使用 R 从网页中抓取表格和链接答案

【问题标题】：Using R to scrape a table and links from a web page使用 R 从网页中抓取表格和链接
【发布时间】：2021-05-03 13:24:25
【问题描述】：

我正在尝试使用 R 抓取网站。我需要该表以及该表中与表中正确行相关联的链接。我可以获取表格和链接，但是因为在 web 表格中有两列带有链接，而表格中的某些行没有链接，并且链接无法按文件名排序和连接。我不知道如何使用与正确行关联的列和链接创建日期框。

library(rvest)

#Read HTML from EPA website 
content <- read_html("https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys")


tables <- content %>% 
          html_table(fill = TRUE) 
EPA_table <- tables[[1]]

#get links from table 
web <- content %>%
    html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
    html_attr("href") #as above

【问题讨论】：

标签： r dataframe web-scraping hyperlink data-cleaning

【解决方案1】：

使用xpath= 的参数来选择列。

## Data links
web <- content %>%
  html_nodes("table tr")%>%
  html_nodes(xpath="//td[3]") %>%  ## xpath
  html_nodes("a") %>%
  html_attr("href")

EPA_table$web1 <- web  ## add Data links column

## metadata links accordingly
web2 <- content %>%
  html_nodes("table tr") %>%
  html_nodes(xpath="//td[4]") %>%  ## xpath
  html_nodes("a") %>%
  html_attr("href")

空的元数据单元格可以设置为NA，描述链接适合不是NA的地方。

EPA_table[EPA_table$Metadata %in% "", "Metadata"] <- NA
EPA_table[!is.na(EPA_table$Metadata), "web2"] <- web2  ## add metadata column

结果

head(EPA_table)
# Survey         Indicator
# 1 Lakes 2007               All
# 2 Lakes 2007    Landscape Data
# 3 Lakes 2007   Water Chemistry
# 4 Lakes 2007 Visual Assessment
# 5 Lakes 2007  Site Information
# 6 Lakes 2007             Notes
# Data
# 1                               NLA 2007 All Data (ZIP)(1 pg, 5 MB)
# 2 NLA 2007 Basin Landuse Metrics - Data 20061022 (CSV)(1 pg, 307 K)
# 3               NLA 2007 Profile - Data 20091008 (CSV)(1 pg, 888 K)
# 4     NLA 2007 Visual Assessment - Data 20091015 (CSV)(1 pg, 813 K)
# 5      NLA 2007 Site Information - Data 20091113 (CSV)(1 pg, 980 K)
# 6                   National Lakes Assessment 2007 Final Data Notes
# Metadata
# 1                                                                <NA>
#   2 NLA 2007 Basin Landuse Metrics - Metadata 20091022 (TXT)(1 pg, 4 K)
# 3             NLA 2007 Profile - Metadata 20091008 (TXT)(1 pg, 650 B)
# 4     NLA 2007 Visual Assessment - Metadata 10091015 (TXT)(1 pg, 7 K)
# 5      NLA 2007 Site Information - Metadata 20091113 (TXT)(1 pg, 8 K)
# 6                                                                <NA>
#   web1
# 1                                /sites/production/files/2017-02/nla2007_alldata.zip
# 2         /sites/production/files/2013-09/nla2007_basin_landuse_metrics_20061022.csv
# 3                       /sites/production/files/2013-09/nla2007_profile_20091008.csv
# 4              /sites/production/files/2014-01/nla2007_visualassessment_20091015.csv
# 5        /sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv
# 6 /national-aquatic-resource-surveys/national-lakes-assessment-2007-final-data-notes
# web2
# 1                                                                             <NA>
#   2  /sites/production/files/2013-09/nla2007_basin_landuse_metrics_info_20091022.txt
# 3              /sites/production/files/2013-09/nla2007_profile_info_20091008_0.txt
# 4       /sites/production/files/2014-01/nla2007_visualassessment_info_20091015.txt
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_info_20091113.txt
# 6                                                                             <NA>

【讨论】：

【解决方案2】：

我会使用css selectors 和:nth-child 从表行的循环中分离出各个列。通过在选择器中使用 tbody，我将排除标题行，只处理表体行并将该列表传递给 map_df

library(rvest)
library(purrr)

url <- "https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys"
rows <- read_html(url) %>% html_nodes("#narsdata tbody tr")

df <- map_df(rows, function(x) {
  data.frame(
    Survey = x %>% html_node("td:nth-child(1)") %>% html_text(),
    Indicator = x %>% html_node("td:nth-child(2)") %>% html_text(),
    Data = x %>% html_node("td:nth-child(3) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
    Metadata = x %>% html_node("td:nth-child(4) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
    stringsAsFactors = FALSE
  )
})

不要认为除了 url 之外你真的需要文件名，但如果是这样，你可以用两个额外的列扩展 data.frame 并提取 html_text 而不是 html_attr 例如

Data_Name = x %>% html_node("td:nth-child(3) a") %>% html_text(),
Metadata_Name = x %>% html_node("td:nth-child(4) a") %>% html_text()

【讨论】：