使用 rvest 从 ballotpedia.org 抓取表格数据答案

【问题标题】：Scraping tabulated data from ballotpedia.org with rvest使用 rvest 从 ballotpedia.org 抓取表格数据
【发布时间】：2018-07-31 21:43:56
【问题描述】：

我正在尝试从以前的美国全州选举结果中获取表格数据，我认为 ballotpedia.org 是获取这些数据的好地方 - 因为所有州的 URL 格式一致。

这是我为测试它而设置的代码：

library(dplyr)
library(rvest)

# STEP 1 - URL COMPONENTS TO SCRAPE FROM
senate_base_url <- "https://ballotpedia.org/United_States_Senate_elections_in_"
senate_state_urls <- gsub(" ", "_", state.name) 
senate_year_urls <- c(",_2012", ",_2014", ",_2016")

# TEST
test_url <- paste0(senate_base_url, senate_state_urls[10], senate_year_urls[2])

这会产生以下 URL：https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014

使用 'selectorgadget' chrome 插件，我选择了包含选举结果的表格，并尝试将其解析为 R，如下所示：

test_data <- read_html(test_url)
test_data <- test_data %>% 
  html_node(xpath = '//*[@id="collapsibleTable0"]') %>% 
  html_table()

但是，我收到以下错误：

Error in UseMethod("html_table") : 
  no applicable method for 'html_table' applied to an object of class "xml_missing"

此外，R 对象test_data 产生一个包含 2 个空元素的列表。

谁能告诉我我在这里做错了什么？ html_table() 函数是错误的吗？使用 html_text() 只返回一个 NA 字符向量。任何帮助将不胜感激，非常感谢:)。

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

您的 xpath 语句不正确，因此 html_node 函数返回空值。

这是一个使用 html 标签的解决方案。 "在中心标签中查找表格标签"

library(rvest) 

test_data <- read_html(test_url)
test_data <- test_data %>% html_nodes("center table") %>% html_table()

或者要检索完全折叠的表格，请使用带有类名的 html 标记：

collapsedtable<-test_data %>% html_nodes("table.collapsible") %>% 
        html_table(fill=TRUE)

【讨论】：

谢谢，这行得通。所以 - 你是如何找到正确的 html 节点的，即“中心表”而不是 '//*[@id="collapsibleTable0"]'？
我在浏览器中使用开发者工具并检查源代码。它比选择器小工具慢，但我在理解结构方面更成功。

【解决方案2】：

这对我有用：

library(httr)
library(XML)

r <- httr::GET("https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014")
XML::readHTMLTable(rawToChar(r$content))[[2]]

【讨论】：