将 HTML 表加载到 R 中的问题答案

【问题标题】：Issue loading HTML Table into R将 HTML 表加载到 R 中的问题
【发布时间】：2021-06-22 19:45:30
【问题描述】：

我想将以下网页底部的表格作为数据框或表格加载到 R 中：https://www.lawschooldata.org/school/Yale%20University/18。我的第一反应是使用 XML 包中的 readHTMLTable 函数

library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table

但是，这会返回一个空列表并给我以下警告：

Warning message:XML content does not seem to be XML: ''

我还尝试调整在 Scraping html tables into R data frames using the XML package 找到的代码。这适用于页面上 6 个表中的 5 个，但只返回了标题行和一行包含第 6 个表的标题行的值，这是我感兴趣的表。代码如下：

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row 
head(applicanttable)

任何见解将不胜感激！作为参考，我还查阅了以下似乎具有相似目标的帖子，但在那里找不到解决方案：

Scraping html tables into R data frames using the XML package Extracting html table from a website in R

【问题讨论】：

标签： html r xml web-scraping

【解决方案1】：

当 JavaScript 在浏览器中运行时，数据是从嵌套的 JavaScript 数组中动态提取的，位于 script 标记内。当您使用 rvest 检索未呈现的内容（如查看源代码中所示）时，不会发生这种情况。

您可以正则表达式输出适当的嵌套数组，然后通过拆分行、添加适当的标题并对不同的列执行一些数据操作来重新构建表，例如有些列包含需要解析的 html 以获得所需的值。

作为一些列，例如Name 包含可以解释为文件路径的值，当使用 read_html 时，我使用 htmltidy 以确保处理为有效的 html。

注意如果您使用 RSelenium，则页面将呈现，您可以直接抓取表格而无需重建它。

待办事项：

您仍然可以选择将一些数据类型操作应用于少数列。
需要应用更多逻辑以确保在Name 列中仅返回Name。以df$Name[10] 为例，这将返回"Character and fitness issues" 而不是Anxiousboy，因为所需的值实际上位于实际选中的p 标记的element.nextSibling.nextSibling 中。这些不常见的边缘情况需要内置一些额外的逻辑。在这种情况下，您可能会测试返回的特定字符串，然后使用 xpath 表达式重新解析。

R：

library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
library(htmltidy)
#> Warning: package 'htmltidy' was built under R version 4.0.3
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

get_value <- function(input) {
  value <- tidy_html(input) %>%
    read_html() %>%
    html_node("a, p, span") %>%
    html_text(trim = T)
  result <- ifelse(is.na(value), input, value)
  return(result)
}

tidy_result <- function(result) {
  return(gsub("<.*", "", result))
}

page <- read_html("https://www.lawschooldata.org/school/Yale%20University/18")

s <- page %>% toString()

headers <- page %>%
  html_nodes("#applicants-table th") %>%
  html_text(trim = T)

s <- stringr::str_extract(s, regex("DataTable\\(\\{\n\\s+data:(.*\\n\\]\\n\\])", dotall = T)) %>%
  gsub("\n", "", .)

rows <- stringr::str_extract_all(s, regex("(\\[.*?\\])", dotall = T))[[1]] %>% as.list()

df <- sapply(rows, function(x) {
  stringr::str_match_all(x, "'(.*?)'")[[1]][, 2]
}) %>%
  t() %>%
  as_tibble(.name_repair = "unique")
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...

names(df) <- headers

df <- df %>%
  rowwise() %>%
  mutate(across(c("Name", "GRE", "URM", "$$$$"), .f = get_value)) %>%
  mutate_at(c("Result"), tidy_result)

write.csv(df, "Yale Applications.csv")

^{由reprex package (v0.3.0) 于 2021 年 6 月 23 日创建}

示例输出：

【讨论】：

这正是我想要的。非常感谢！