【问题标题】:Using R to scrape a table and links from a web page使用 R 从网页中抓取表格和链接
【发布时间】:2021-05-03 13:24:25
【问题描述】:

我正在尝试使用 R 抓取网站。我需要该表以及该表中与表中正确行相关联的链接。我可以获取表格和链接,但是因为在 web 表格中有两列带有链接,而表格中的某些行没有链接,并且链接无法按文件名排序和连接。我不知道如何使用与正确行关联的列和链接创建日期框。

library(rvest)

#Read HTML from EPA website 
content <- read_html("https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys")


tables <- content %>% 
          html_table(fill = TRUE) 
EPA_table <- tables[[1]]

#get links from table 
web <- content %>%
    html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
    html_attr("href") #as above

【问题讨论】:

    标签: r dataframe web-scraping hyperlink data-cleaning


    【解决方案1】:

    使用xpath= 的参数来选择列。

    ## Data links
    web <- content %>%
      html_nodes("table tr")%>%
      html_nodes(xpath="//td[3]") %>%  ## xpath
      html_nodes("a") %>%
      html_attr("href")
    
    EPA_table$web1 <- web  ## add Data links column
    
    ## metadata links accordingly
    web2 <- content %>%
      html_nodes("table tr") %>%
      html_nodes(xpath="//td[4]") %>%  ## xpath
      html_nodes("a") %>%
      html_attr("href")
    

    空的元数据单元格可以设置为NA,描述链接适合不是NA的地方。

    EPA_table[EPA_table$Metadata %in% "", "Metadata"] <- NA
    EPA_table[!is.na(EPA_table$Metadata), "web2"] <- web2  ## add metadata column
    

    结果

    head(EPA_table)
    # Survey         Indicator
    # 1 Lakes 2007               All
    # 2 Lakes 2007    Landscape Data
    # 3 Lakes 2007   Water Chemistry
    # 4 Lakes 2007 Visual Assessment
    # 5 Lakes 2007  Site Information
    # 6 Lakes 2007             Notes
    # Data
    # 1                               NLA 2007 All Data (ZIP)(1 pg, 5 MB)
    # 2 NLA 2007 Basin Landuse Metrics - Data 20061022 (CSV)(1 pg, 307 K)
    # 3               NLA 2007 Profile - Data 20091008 (CSV)(1 pg, 888 K)
    # 4     NLA 2007 Visual Assessment - Data 20091015 (CSV)(1 pg, 813 K)
    # 5      NLA 2007 Site Information - Data 20091113 (CSV)(1 pg, 980 K)
    # 6                   National Lakes Assessment 2007 Final Data Notes
    # Metadata
    # 1                                                                <NA>
    #   2 NLA 2007 Basin Landuse Metrics - Metadata 20091022 (TXT)(1 pg, 4 K)
    # 3             NLA 2007 Profile - Metadata 20091008 (TXT)(1 pg, 650 B)
    # 4     NLA 2007 Visual Assessment - Metadata 10091015 (TXT)(1 pg, 7 K)
    # 5      NLA 2007 Site Information - Metadata 20091113 (TXT)(1 pg, 8 K)
    # 6                                                                <NA>
    #   web1
    # 1                                /sites/production/files/2017-02/nla2007_alldata.zip
    # 2         /sites/production/files/2013-09/nla2007_basin_landuse_metrics_20061022.csv
    # 3                       /sites/production/files/2013-09/nla2007_profile_20091008.csv
    # 4              /sites/production/files/2014-01/nla2007_visualassessment_20091015.csv
    # 5        /sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv
    # 6 /national-aquatic-resource-surveys/national-lakes-assessment-2007-final-data-notes
    # web2
    # 1                                                                             <NA>
    #   2  /sites/production/files/2013-09/nla2007_basin_landuse_metrics_info_20091022.txt
    # 3              /sites/production/files/2013-09/nla2007_profile_info_20091008_0.txt
    # 4       /sites/production/files/2014-01/nla2007_visualassessment_info_20091015.txt
    # 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_info_20091113.txt
    # 6                                                                             <NA>
    

    【讨论】:

      【解决方案2】:

      我会使用css selectors:nth-child 从表行的循环中分离出各个列。通过在选择器中使用 tbody,我将排除标题行,只处理表体行并将该列表传递给 map_df

      library(rvest)
      library(purrr)
      
      url <- "https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys"
      rows <- read_html(url) %>% html_nodes("#narsdata tbody tr")
      
      df <- map_df(rows, function(x) {
        data.frame(
          Survey = x %>% html_node("td:nth-child(1)") %>% html_text(),
          Indicator = x %>% html_node("td:nth-child(2)") %>% html_text(),
          Data = x %>% html_node("td:nth-child(3) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
          Metadata = x %>% html_node("td:nth-child(4) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
          stringsAsFactors = FALSE
        )
      })
      

      不要认为除了 url 之外你真的需要文件名,但如果是这样,你可以用两个额外的列扩展 data.frame 并提取 html_text 而不是 html_attr 例如

      Data_Name = x %>% html_node("td:nth-child(3) a") %>% html_text(),
      Metadata_Name = x %>% html_node("td:nth-child(4) a") %>% html_text()
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2013-11-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-08-01
        • 1970-01-01
        • 2023-03-26
        • 1970-01-01
        相关资源
        最近更新 更多