【发布时间】:2018-11-25 14:37:44
【问题描述】:
在这个问题上花了很多时间并查看了可用的答案之后,我想继续提出一个新问题来解决我使用 R 和 rvest 进行网络抓取的问题。我已尝试全面阐述问题以尽量减少问题
问题 我正在尝试从会议网页中提取作者姓名。作者按姓氏字母顺序分隔;因此,我需要使用 for 循环调用 follow_link() 25 次以转到每个页面并提取相关的作者文本。
会议网站: https://gsa.confex.com/gsa/2016AM/webprogram/authora.html
我使用 rvest 在 R 中尝试了两种解决方案,但都有问题。
解决方案 1(信件调用链接)
lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
此代码有效.. 到了一定程度。下面是输出。它将成功浏览字母页面,直到 H-I 转换和 L-M 转换,此时它会抓取错误的页面。
Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home
解决方案 2(CSS 调用链接) 在页面上使用 CSS 选择器,每个字母页面都被标识为“a:nth-child(1-26)”。所以我通过调用那个 CSS 标识符来重建我的循环。
tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
这可行种类。同样,它在某些转换方面遇到了麻烦(见下文)
Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html
具体来说,此方法会遗漏 B、C 和 D。在此步骤循环到不正确的页面。对于如何重新配置上述代码以正确循环所有 26 个字母页面的任何见解或指导,我将不胜感激。
非常感谢!
【问题讨论】:
标签: css r web-scraping rvest