【问题标题】:Scrape nested html structure抓取嵌套的html结构
【发布时间】:2021-06-05 12:15:47
【问题描述】:

我想从this site 抓取数据,而不会丢失嵌套结构中的信息。考虑名称benodanil,它不仅属于benzanilide fungicides,还属于anilide fungicidesamide fungicides。不一定总是 3 类,但至少有一个,最多有多个。所以,理想情况下,我想要一个看起来像这样的 data.frame:

name class1 class2 class3 ...
benodanil benzanilide fungicides anilide fungicides amide fungicides NA
aureofungin antibiotic fungicides NA NA NA
... ... ... ...

我可以抓取数据,但不知道如何处理嵌套结构中的信息。到目前为止我尝试了什么:

require(rvest)

url = 'http://www.alanwood.net/pesticides/class_fungicides.html'

site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')

# loop idea
l = list()
for (i in seq_along(li)) {
  li1 = html_nodes(li[i], 'a')
  name = na.omit(unique(html_attr(li1, 'href')))
  clas = na.omit(unique(html_attr(li1, 'name')))
  
  l[[i]] = list(name = name,
                clas = clas)
}

另一个问题是,某些名称出现多次,例如bixafen。因此,我想这项工作必须迭代完成。

【问题讨论】:

    标签: html r web-scraping rvest


    【解决方案1】:
    library(dplyr)
    library(tidyr)
    library(rvest)
    
    url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
    
    site = read_html(url)
    a <- site %>% html_nodes('li ul a')
    
    tibble(name = a %>% html_attr('href'), 
           class = a %>% html_attr('name')) %>%
      fill(class) %>%
      filter(!is.na(name)) %>%
      mutate(name = sub('\\.html', '', name)) %>%
      group_by(name) %>%
      mutate(col = paste0('class', row_number())) %>%
      pivot_wider(names_from = col, values_from = class) %>%
      ungroup()
    
    # A tibble: 189 x 4
    #   name         class1                  class2                class3                     
    #   <chr>        <chr>                   <chr>                 <chr>                      
    # 1 benalaxyl    acylamino_acid_fungici… anilide_fungicides    NA                         
    # 2 benalaxyl-m  acylamino_acid_fungici… anilide_fungicides    NA                         
    # 3 furalaxyl    acylamino_acid_fungici… furanilide_fungicides NA                         
    # 4 metalaxyl    acylamino_acid_fungici… anilide_fungicides    NA                         
    # 5 metalaxyl-m  acylamino_acid_fungici… anilide_fungicides    NA                         
    # 6 pefurazoate  acylamino_acid_fungici… NA                    NA                         
    # 7 valifenalate acylamino_acid_fungici… NA                    NA                         
    # 8 bixafen      anilide_fungicides      picolinamide_fungici… pyrazolecarboxamide_fungic…
    # 9 boscalid     anilide_fungicides      NA                    NA                         
    #10 carboxin     anilide_fungicides      NA                    NA                         
    # … with 179 more rows
    

    从网页中提取nameclassfill之前非NA的NA值,删除带有NA值的行并获取宽格式数据。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-05-20
      • 2020-01-11
      • 2021-10-08
      • 2013-10-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多