【发布时间】:2021-06-22 19:45:30
【问题描述】:
我想将以下网页底部的表格作为数据框或表格加载到 R 中:https://www.lawschooldata.org/school/Yale%20University/18。我的第一反应是使用 XML 包中的 readHTMLTable 函数
library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table
但是,这会返回一个空列表并给我以下警告:
Warning message:XML content does not seem to be XML: ''
我还尝试调整在 Scraping html tables into R data frames using the XML package 找到的代码。这适用于页面上 6 个表中的 5 个,但只返回了标题行和一行包含第 6 个表的标题行的值,这是我感兴趣的表。代码如下:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row
head(applicanttable)
任何见解将不胜感激!作为参考,我还查阅了以下似乎具有相似目标的帖子,但在那里找不到解决方案:
Scraping html tables into R data frames using the XML package Extracting html table from a website in R
【问题讨论】:
标签: html r xml web-scraping