使用 R 和 Rvest 在没有明显分页类的情况下抓取 html 表答案

【问题标题】：scraping html tables without an obvious pagination class using R and Rvest使用 R 和 Rvest 在没有明显分页类的情况下抓取 html 表
【发布时间】：2018-10-19 17:28:31
【问题描述】：

我正在尝试从一个站点 (thenumbers.com) 中抓取数据，该站点的数据跨越多个网页。顺序网页的格式是这样的（下面只是前三个）：

url0 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time"
url1 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/101"
url2 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/201"

要将第一个连续的 url (url0) 抓取到 df 中，此代码将返回正确的输出。

library(rvest)

webpage <- read_html("https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time")

tbls <- html_nodes(webpage, "table")

head(tbls)

tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[1] %>%
  html_table(fill = TRUE)

df <- tbls_ls[[1]]

输出的样子：

> head(df)
  Rank Released                                Movie DomesticBox Office
1    1     2015 Star Wars Ep. VII: The Force Awakens       $936,662,225
2    2     2009                               Avatar       $760,507,625
3    3     2018                        Black Panther       $700,059,566

如何自动抓取后续 url，直到我们到达数据的末尾，以便输出是一个很长的 df，它已经被rowbind()ed 在一起？

【问题讨论】：

在每个页面的底部都有一个<div> 与class="pagination" 分类，并带有指向下一个n 页面的链接。从第一页开始，抓取表格和分页信息并迭代直到没有更多链接。 SO上有很多这样的例子（一些最近的）
顺便说一句，虽然the-numbers.com/robots.txt 没有对此路径实施技术控制，但有一点道德的人应该阅读the-numbers.com/research-analysis 并至少捐赠（甚至是1.00 美元），如果他们'重新使用数据。
@hrbrmstr 很高兴您指出了指向我的链接，我会做出贡献
FWIW 我实际上练习了我在 abt 上喋喋不休的内容。我制作了 OMDB API 包 — github.com/hrbrmstr/omdbapi — 并且可以显示 patreon 每月收据 1.00 美元，尽管我个人在不教授高级 R 课程时从不使用 API（而且我过去没有教过这个） 3 个学期）。

标签： r web-scraping rvest

【解决方案1】：

这个问题是在 3 年前的几个月前提出的；但是这里有一个解决方案。

首先，确定是否允许抓取网站始终是一个好主意。在 R 中，我们可以使用 robotstxt 包：

robotstxt::paths_allowed("https://www.the-numbers.com")
 www.the-numbers.com                      

[1] TRUE

好的，我们可以出发了。另外，我想重申@hrbrmstr 所指出的关于捐赠（即使是最小的金额）作为支持网站（或任何其他类似网站）背后的人们正在做的事情的一种方式。

我在下面定义的抓取函数利用了 R 中的 repeat/if 构造（类似于其他编程语言中的 do-while 循环）。此外，由于要抓取的页面数量未知，因此该函数有一个page_count 参数，默认为Inf。保持这样会刮掉网站上的所有页面。但是，如果想抓取 10 页，那么他们可以设置 page_count = 10。这是函数定义：

# Load packages ----

pacman::p_load(
  rvest,
  glue,
  stringr,
  dplyr,
  cli
)

# Custom function ----

scrape_data <- function(url, page_count = Inf){
  
  i <- 1
  data_list <- list()
  
  repeat {
    
    html <- read_html(url) 
    
    data_list[[i]] <- html %>%
      html_element(css = "table") %>%
      html_table()
    
    current_page <- html %>%
      html_element(css = "div.pagination > a.active") %>%
      html_text() %>%
      str_remove_all(pattern = "\\,")
    
    all_displayed_pages <- html %>%
      html_elements(css = "div.pagination > a") %>%
      html_text() %>%
      str_remove_all(pattern = "\\,") %>%
      str_extract(pattern = "\\d+\\-\\d+")
    
    all_pages_urls <- html %>%
      html_elements(css = "div.pagination > a") %>%
      html_attr(name = "href")
    
    url <- glue("https://www.the-numbers.com{all_pages_urls[which(current_page == all_displayed_pages)+1]}")
    cli_alert_success(glue("Scraped page: {i}"))
    
    i <- i + 1
    
    if(
      current_page == all_displayed_pages[length(all_displayed_pages)] |
      i - 1 == page_count
    ){
      break
    }
  }
  
  bind_rows(data_list)
  
}

现在让我们使用函数来抓取表格的前 5 页：

scrape_data(
  url = "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time",
  page_count = 5
)

√ Scraped page: 1
√ Scraped page: 2
√ Scraped page: 3
√ Scraped page: 4
√ Scraped page: 5
# A tibble: 500 x 7
    Rank  Year Movie                                Distributor `DomesticBox Of~ `InternationalB~ `WorldwideBox O~
   <int> <int> <chr>                                <chr>       <chr>            <chr>            <chr>           
 1     1  2015 Star Wars Ep. VII: The Force Awakens Walt Disney $936,662,225     $1,127,953,592   $2,064,615,817  
 2     2  2019 Avengers: Endgame                    Walt Disney $858,373,000     $1,939,427,564   $2,797,800,564  
 3     3  2009 Avatar                               20th Cent…  $760,507,625     $2,085,391,916   $2,845,899,541  
 4     4  2018 Black Panther                        Walt Disney $700,059,566     $636,434,755     $1,336,494,321  
 5     5  2018 Avengers: Infinity War               Walt Disney $678,815,482     $1,365,725,041   $2,044,540,523  
 6     6  1997 Titanic                              Paramount…  $659,363,944     $1,548,622,601   $2,207,986,545  
 7     7  2015 Jurassic World                       Universal   $652,306,625     $1,017,673,342   $1,669,979,967  
 8     8  2012 The Avengers                         Walt Disney $623,357,910     $891,742,301     $1,515,100,211  
 9     9  2017 Star Wars Ep. VIII: The Last Jedi    Walt Disney $620,181,382     $711,453,759     $1,331,635,141  
10    10  2018 Incredibles 2                        Walt Disney $608,581,744     $634,223,615     $1,242,805,359  
# ... with 490 more rows

该功能的一个可能改进是使用Sys.sleep(3) 增加一些不活动时间（3 秒），以防服务器因为试图太快点击太多次而将您踢出网站。

【讨论】：

只是好奇您是如何偶然发现这个问题的：您是否正在使用这些数据开展项目？感谢您的精彩回答
@JeremyK。你很受欢迎。很高兴我能帮上忙。不，不是。我目前正在学习如何抓取不那么明显的东西，所以我在互联网上寻找例子来尝试。我就是这样找到你的问题的。