R readHTMLTable函数不起作用答案

【问题标题】：R readHTMLTable function not workingR readHTMLTable函数不起作用
【发布时间】：2016-08-06 08:21:10
【问题描述】：

我有以下用 R 编写的代码，我想从 this particular webpage 获取一些名称。

library(RCurl)
library(XML)
x <- getURL("http://www.encyclopedia-titanica.org/titanic-passengers-crew-lived/country-17/england.html")
x_2 <- htmlParse(x)
x_3 <- readHTMLTable(x_2)

但是，每当我查看 x_3 的内容时，我都会得到以下信息...

x_3
named list()

似乎 readHTMLTable 函数无法获取表格。谁能帮我从这个网页获取乘客的姓名，而无需复制和粘贴？非常感激。

【问题讨论】：

您需要先提取表格元素，然后才能使用 readHTMLTable()。使用 XPath - 类似于 tableVar <- xpathApply(x_2, "//table[@id='manifest']")。那么你应该可以做到x_3 <- readHTMLTable(tableVar)
（顺便说一句，我的 ATM 遇到防火墙问题，所以我无法测试这个......）

标签： r xml-parsing web-scraping

【解决方案1】：

library(rvest)
library(dplyr)

base <- "http://www.encyclopedia-titanica.org/titanic-passengers-crew-lived/country-17/england.html"

# I use the older rvest package...`html` might be `read_html` now.Link to git repo below:
# https://github.com/hadley/rvest/blob/7d65d84e013b1bb3827ae0a2e05ddaed4875c112/R/parse.R
data_df <- (html(base) %>% html_table)[[1]]

knitr::kable(summary(data_df))

    |   |    Name         |    Age          | Class/Dept      |   Ticket        |   Joined        |    Job          |Boat [Body]      |             |
    |:--|:----------------|:----------------|:----------------|:----------------|:----------------|:----------------|:----------------|:------------|
    |   |Length:1190      |Length:1190      |Length:1190      |Length:1190      |Length:1190      |Length:1190      |Length:1190      |Mode:logical |
    |   |Class :character |Class :character |Class :character |Class :character |Class :character |Class :character |Class :character |NA's:1190    |
    |   |Mode  :character |Mode  :character |Mode  :character |Mode  :character |Mode  :character |Mode  :character |Mode  :character |NA           |

【讨论】：

非常感谢这个解决方案。效果很好！
很高兴听到它@ACE