【问题标题】:Find text including umlaut with xpath in R在R中使用xpath查找包括变音符号的文本
【发布时间】:2021-05-04 08:43:59
【问题描述】:

我想通过text() 识别包含带有“Umlaute”文本的节点。

library(xml2)
library(rvest)
doc <- "<p>Über uns </p>" %>% xml2::read_html()
grepl(pattern = "Über uns", x = as.character(doc))
grepl(pattern = "Über uns", x = doc)

问题:

如何提取包含文本“Über uns”的节点?

尝试了什么:

https://forum.fhem.de/index.php?topic=96254.0

Java XPath umlaut/vowel parsing

# does not work
xp <- paste0("//*[contains(text(), 'Über uns')]")
html_nodes(x = doc, xpath = xp)

# does not work    
xp <- paste0("//*[translate(text(), 'Ü', 'U') = 'Uber uns']")
html_nodes(x = doc, xpath = xp)

# does not work
xp <- paste0("//*[contains(text(), '&Uuml;ber uns')]")
html_nodes(x = doc, xpath = xp)


# this works but i wonder if there is a solution with xpath
doc2 <- doc %>% 
  as.character() %>% 
  gsub(pattern = "Ü", replacement = "Ue") %>% 
  xml2::read_html()

xp <- paste0("//*[contains(text(), 'Ueber uns')]")
html_nodes(x = doc2, xpath = xp)

【问题讨论】:

    标签: r xpath rvest xml2


    【解决方案1】:

    这听起来像是编码问题;它适用于en_US.UTF-8。也许将您的默认文本编码更改为 UTF-8(例如在 RStudio 中:工具 - 全局选项 - 代码 - 保存 - 默认文本编码)或临时切换:

    library(xml2)
    library(rvest)
    old.locale <- Sys.getlocale("LC_CTYPE")
    Sys.setlocale("LC_CTYPE", 'C') # using non-UTF-8 encoding
    #> [1] "C"
    doc <- "<p>Über uns </p>" %>% xml2::read_html()
    xp <- paste0("//*[contains(text(), 'Über uns')]")
    html_nodes(x = doc, xpath = xp)
    {xml_nodeset (0)}
    
    Sys.setlocale("LC_CTYPE", 'en_US.UTF-8') # using UTF-8 encoding
    #> [1] "en_US.UTF-8"
    
    doc <- "<p>Über uns </p>" %>% xml2::read_html()
    xp <- paste0("//*[contains(text(), 'Über uns')]")
    html_nodes(x = doc, xpath = xp)
    #> {xml_nodeset (1)}
    #> [1] <p>Über uns </p>
    
    Sys.setlocale("LC_CTYPE", old.locale)
    #> [1] "en_US.UTF-8"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-03-10
      • 1970-01-01
      • 2012-05-13
      • 1970-01-01
      • 1970-01-01
      • 2015-02-19
      • 2013-05-27
      相关资源
      最近更新 更多