从 Wikipedia 抓取表格时遇到问题答案

【问题标题】：Trouble scraping table from Wikipedia从 Wikipedia 抓取表格时遇到问题
【发布时间】：2015-11-27 08:34:55
【问题描述】：

我在关注this question 的选定答案时遇到问题。我要抓取的表是this list of U.S. state populations。

library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

这是我遇到的错误..

Error: failed to load external entity "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"

什么给了？

（注意 - 虽然我正在寻找解决此错误的方法，但如果您能指出一种更简单的获取人口数据的方法，我将不胜感激。）

【问题讨论】：

Wikipedia 允许免费下载他们的整个数据库...en.wikipedia.org/wiki/Wikipedia:Database_download 这应该可以减轻已经用尽的网络服务器的压力
err，您可以点击页面底部的相关数据的参考链接，然后转到the reference site，也称为人口普查，并下载包含的csv或xls其中。
@ScottMcGready，你必须有一个大的外部高清。 :) 您建议的下载量可不小，仅适用于 50 行的表格，其中包含几列感兴趣的列。
@ShawnMehan 也许......
另外，根据我的经验，简单的英文维基百科通常更容易抓取：simple.wikipedia.org/wiki/List_of_U.S._states_by_population

标签： r xml web-scraping

【解决方案1】：

您的代码没有问题。但是，您的网址有问题。

您可以通过进入 shell 并尝试验证代码中的外部输入不会导致它失败来测试这一点，例如，

curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population

这将返回一个空的主体，类似于您的 R 代码。这应该使您相信不是您的 R 代码有问题。发现这一点后，您可能会进入页面中您感兴趣的部分，再次使用 curl 中的免费且简单的测试环境，然后运行

curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population#States_and_territories

绝对不会返回空结果：

...
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-List_of_U_S_states_and_territories_by_population skin-vector action-view">
    <div id="mw-page-base" class="noprint"></div>
    <div id="mw-head-base" class="noprint"></div>
    <div id="content" class="mw-body" role="main">

【讨论】：

【解决方案2】：

这在rvest 中很容易做到

library(rvest); library(magrittr) # for %>%

theurl %>%
  html() %>%
  html_nodes("table") %>% extract(1) %>%
  html_table(fill=TRUE) %>% extract(1) -> pop_table

查看@Cory 的blog 了解更多信息。

【讨论】：