R：网页抓取表格答案

【问题标题】：R: webscraping a tableR：网页抓取表格
【发布时间】：2016-05-09 09:16:16
【问题描述】：

我正在尝试抓取实时汇率网页。我试过了：

library(XML)
webpage  <- "http://liveindex.org/"

tables <- readHTMLTable(webpage )
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

但我收到一条错误消息。

感谢您的帮助。

【问题讨论】：

您好，我不是专家，但实际上它似乎不是 XML。为了比较，您可以查看 XML 格式的 ECB 网站。如果您有兴趣，我可以分享如何从那里获取费率的代码。关于汇率这个话题我推荐this question.
我不知道您是否可以为逐个报价数据执行此操作。但您可以从这里开始。reviews % read_html() %>% html_nodes("#menu_content .inline_rates_container")。如果我尝试提取价值，我会得到 NA。
“您不得使用任何计算机化或自动机制，包括但不限于任何网络爬虫、蜘蛛或机器人来访问、提取和/或下载任何信息，包括但不限于任何货币兑换数据，来自网站或工具"
非常感谢您提供的信息。我已更改链接

标签： r

【解决方案1】：

我能够以字符向量的形式提取表格的内容（注意：我在此示例中使用了 Windows）。

library(RDCOMClient)
library(stringr)
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate("http://liveindex.org/")
Sys.sleep(5)
doc <- IEApp$Document()
Sys.sleep(5)
inner_Text <- doc$documentElement()$innerText()

inner_Text_Splitted <- strsplit(inner_Text, "\n")[[1]]
inner_Text_Splitted <- inner_Text_Splitted[nchar(inner_Text_Splitted) < 1000]
inner_Text_Splitted <- inner_Text_Splitted[inner_Text_Splitted != "\r"]
inner_Text_Splitted <- inner_Text_Splitted[inner_Text_Splitted != " \r"]
inner_Text_Splitted <- inner_Text_Splitted[inner_Text_Splitted != "   \r"]

# More cleaning required but the information of the table is in the variable inner_Text_Splitted

需要对变量 inner_Text_Splitted 进行更多清理，但信息就在那里。此外，您可以使用 R 包 RSelenium 获得类似的结果。

【讨论】：