【问题标题】:Extract all text & tags between two heading tags (<h3>) with rvest使用 rvest 提取两个标题标签 (<h3>) 之间的所有文本和标签
【发布时间】:2023-03-20 11:05:01
【问题描述】:

This page 显示六个部分,列出&lt;h3&gt; 标记之间的人员。

我如何使用 XPath 分别选择这六个部分(使用 rvest),也许进入一个嵌套列表?我的目标是稍后lapply 通过这六个部分来获取人的姓名和隶属关系(按部分分隔)。

HTML 的结构不是很好,即不是每个文本都位于特定的标签内。一个例子:

<h3>Editor-in-Chief</h3>
Claudio Ronco &ndash; <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark &ndash; <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi &ndash; <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />

我使用以下代码访问该网站:

journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
                  httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))

webpage <- rvest::html_nodes(webpage, css = '#editorialboard')

我尝试了各种 XPath 以将 html_nodes 的六个部分提取到六个列表的嵌套列表中,但它们都不能正常工作:

# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')

# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')

# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')

# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    您是否接受不使用 XPath 的丑陋解决方案?我不认为你可以从这个网站的结构中得到一个嵌套列表......但我对 xpath 不是很有经验。

    我首先得到标题,使用标题名称划分原始文本,然后在每个组中,使用 '\n' 作为分隔符划分成员。

    journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
    
    webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')
    
    # get h3 headings
    headings <- webpage %>% html_nodes('h3') %>% html_text()
    
    # get raw text
    raw.text <- webpage %>% html_text()
    
    # split raw text on h3 headings and put in a list
    list.members <- list()
    raw.text.2 <- raw.text
    for (h in headings) {
      # split on headings
      b <- strsplit(raw.text.2, h, fixed=TRUE)
      # split members using \n as separator
      c <- strsplit(b[[1]][1], '\n', fixed=TRUE)
      # clean empty elements from vector
      c <- list(c[[1]][c[[1]] != ""])
      # add vector of member to list
      list.members <- c(list.members, c)
      # update text
      raw.text.2 <- b[[1]][2]
    }
    # remove first element of main list
    list.members <- list.members[2:length(list.members)]
    # add final segment of raw.text to list
    c <- strsplit(raw.text.2, '\n', fixed=TRUE)
    c <- list(c[[1]][c[[1]] != ""])
    list.members <- c(list.members, c)
    # add names to list
    names(list.members) <- headings
    

    然后你得到一个组列表,列表中的每个元素都是一个向量,每个成员都有字符串(使用所有信息)

    > list.members$`Editor-in-Chief`
    [1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
    > list.members$`Clinical Engineering`
     [1] "William R. Clark – Purdue University, West Lafayette, IN, USA"                     
     [2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"                  
     [3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"                             
     [4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"            
     [5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy" 
     [6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"                       
     [7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"                 
     [8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
     [9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"    
    [10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"           
    [11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"    
    [12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"
    

    【讨论】:

    • 太好了,效果很好,谢谢!是的,我对“丑陋”的解决方案没有意见--我认为这是网络抓取不可避免的:-)
    • 太棒了!别客气。我注意到我的代码中有一个小错字(当我将 raw.text 的最后一段添加到列表中时)。我更正了它,但它并没有改变结果。
    猜你喜欢
    • 1970-01-01
    • 2019-09-02
    • 2019-04-14
    • 1970-01-01
    • 2011-12-31
    • 1970-01-01
    • 2020-06-19
    • 1970-01-01
    • 2016-03-22
    相关资源
    最近更新 更多