如何访问使用 RSelenium 和 rvest 抓取的页面？答案

【问题标题】：How to access a page scraped using RSelenium with rvest?如何访问使用 RSelenium 和 rvest 抓取的页面？
【发布时间】：2018-02-11 15:33:00
【问题描述】：

我正在尝试抓取使用 angular.js 的网页。我的理解是R中唯一的选择是先使用RSelenium加载页面，然后解析内容。但是，我发现rvest 比 RSelenium 更直观地解析内容，因此我想尽可能少地使用 RSelenium，然后尽快切换到rvest。

到目前为止，我已经意识到我可能至少需要使用 RSelenium 来连接并使用 htmlTreeParse 下载 html 代码。假设这是我的输出的一部分：

structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

如何将其传递给rvest::read_html()？

【问题讨论】：

我怀疑你需要绕过read_html，而不是喂它。 read_html 的目的是下载数据，以便后续函数（例如，html_nodes）可以对其进行处理。不幸的是，对read_html 的输出的简要检查表明它并非微不足道，因为它不包含实际数据，只是指针。这可能是很多事情，但逆向工程要困难得多。也许您应该考虑直接使用xml2 而不是通过rvest？

标签： r web-scraping html-parsing rvest rselenium

【解决方案1】：

如果您查看项目的类，它是一个XMLNode，它是由XML 包定义的类。在其中，它为toString（但奇怪的是不是as.character）定义了一个方法，该方法允许您将节点转换为普通字符串，而xml2::read_html又可以读取该字符串：

library(rvest)
#> Loading required package: xml2

node <- structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

node %>% XML::toString.XMLNode() %>% read_html()
#> {xml_document}
#> <html>
#> [1] <body><div class="im_dialog_date" ng-bind="dialogMessage.dateText">6 ...

也就是说，我通常只使用RSelenium::remoteDriver 的getPageSource() 方法来抓取所有HTML，然后用rvest 轻松解析。

【讨论】：

是的！只需将您的远程驱动程序带到您想要的页面（使用 JavaScript 运行、登录、提交表单、单击按钮等），然后直接获取页面源，而不是尝试在 RSelenium 中选择节点。
我遇到了一个错误，我刚刚意识到原因是我需要传递列表的内容，因此read_html(remDr$getPageSource()[[1]])
另一个选项是@hrbrmstr 的新splashr 包，它可以很好地管道，其render_html 和splash_html 函数返回HTML 已经被xml2::read_html 读取。
啊，是的，忘记输出是什么样子了，抱歉。