在没有 RSelenium 的情况下在 R 中抓取帧？答案

【问题标题】：Scraping frames in R without RSelenium?在没有 RSelenium 的情况下在 R 中抓取帧？
【发布时间】：2021-09-13 15:06:10
【问题描述】：

点击此页面上的“信息”后，我需要刮掉右侧框架中可见的“稿件接收日期”：https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717。我尝试使用下面列出的 rvest 脚本，它在类似情况下运行良好。但是，在这种情况下它不起作用，可能是因为需要单击才能访问发布历史记录。我尝试通过将#pane-pcw-details 添加到 url (https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details) 来解决此问题，但无济于事。另一种选择是使用 RSelenium，但也许有更简单的解决方法？

library(rvest)

link <-c("https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details")
wiley_output <-data.frame()

page = read_html(link)
revhist = page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()
wiley_output = rbind(wiley_output, data.frame(link, revhist, stringsAsFactors = FALSE))

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

该数据来自您可以在网络选项卡中找到的 ajax 调用。它有很多查询字符串参数，但实际上您只需要 文档标识符 以及 ajax = True 以确保返回与 指定的 ajax 关联的数据行动：

https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717

library(rvest)
library(magrittr)

link <- 'https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717'  
page <- read_html(link)   
page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()

【讨论】：