使用 rvest 提取两个标题标签 (<h3>) 之间的所有文本和标签答案

【问题标题】：Extract all text & tags between two heading tags (<h3>) with rvest使用 rvest 提取两个标题标签 (<h3>) 之间的所有文本和标签
【发布时间】：2023-03-20 11:05:01
【问题描述】：

This page 显示六个部分，列出<h3> 标记之间的人员。

我如何使用 XPath 分别选择这六个部分（使用 rvest），也许进入一个嵌套列表？我的目标是稍后lapply 通过这六个部分来获取人的姓名和隶属关系（按部分分隔）。

HTML 的结构不是很好，即不是每个文本都位于特定的标签内。一个例子：

<h3>Editor-in-Chief</h3>
Claudio Ronco &ndash; <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark &ndash; <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi &ndash; <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />

我使用以下代码访问该网站：

journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
                  httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))

webpage <- rvest::html_nodes(webpage, css = '#editorialboard')

我尝试了各种 XPath 以将 html_nodes 的六个部分提取到六个列表的嵌套列表中，但它们都不能正常工作：

# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')

# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')

# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')

# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

您是否接受不使用 XPath 的丑陋解决方案？我不认为你可以从这个网站的结构中得到一个嵌套列表......但我对 xpath 不是很有经验。

我首先得到标题，使用标题名称划分原始文本，然后在每个组中，使用 '\n' 作为分隔符划分成员。

journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"

webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')

# get h3 headings
headings <- webpage %>% html_nodes('h3') %>% html_text()

# get raw text
raw.text <- webpage %>% html_text()

# split raw text on h3 headings and put in a list
list.members <- list()
raw.text.2 <- raw.text
for (h in headings) {
  # split on headings
  b <- strsplit(raw.text.2, h, fixed=TRUE)
  # split members using \n as separator
  c <- strsplit(b[[1]][1], '\n', fixed=TRUE)
  # clean empty elements from vector
  c <- list(c[[1]][c[[1]] != ""])
  # add vector of member to list
  list.members <- c(list.members, c)
  # update text
  raw.text.2 <- b[[1]][2]
}
# remove first element of main list
list.members <- list.members[2:length(list.members)]
# add final segment of raw.text to list
c <- strsplit(raw.text.2, '\n', fixed=TRUE)
c <- list(c[[1]][c[[1]] != ""])
list.members <- c(list.members, c)
# add names to list
names(list.members) <- headings

然后你得到一个组列表，列表中的每个元素都是一个向量，每个成员都有字符串（使用所有信息）

> list.members$`Editor-in-Chief`
[1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
> list.members$`Clinical Engineering`
 [1] "William R. Clark – Purdue University, West Lafayette, IN, USA"                     
 [2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"                  
 [3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"                             
 [4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"            
 [5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy" 
 [6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"                       
 [7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"                 
 [8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
 [9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"    
[10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"           
[11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"    
[12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"

【讨论】：

太好了，效果很好，谢谢！是的，我对“丑陋”的解决方案没有意见--我认为这是网络抓取不可避免的:-)
太棒了！别客气。我注意到我的代码中有一个小错字（当我将 raw.text 的最后一段添加到列表中时）。我更正了它，但它并没有改变结果。