【问题标题】:Is there a way to scrape through multiple pages on a website in R有没有办法在 R 中浏览网站上的多个页面
【发布时间】:2021-04-06 13:04:30
【问题描述】:

我是 R 和网络抓取的新手。为了练习,我试图从一个有多个页面的假网站('http://books.toscrape.com/catalogue/page-1.html')中抓取书名,然后根据书名计算某些指标。每页有 20 本书和 50 页,我已经设法抓取并计算了前 20 本书的指标,但是我想计算网站上全部 1000 本书的指标。

当前输出如下所示:

 [1] "A Light in the Attic"                                                                          
 [2] "Tipping the Velvet"                                                                            
 [3] "Soumission"                                                                                    
 [4] "Sharp Objects"                                                                                 
 [5] "Sapiens: A Brief History of Humankind"                                                         
 [6] "The Requiem Red"                                                                               
 [7] "The Dirty Little Secrets of Getting Your Dream Job"                                            
 [8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"       
 [9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"                                                                               
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"                                                
[12] "Shakespeare's Sonnets"                                                                         
[13] "Set Me Free"                                                                                   
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"                                       
[15] "Rip it Up and Start Again"                                                                     
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"            
[17] "Olio"                                                                                          
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"                                         
[19] "Libertarianism for Beginners"                                                                  
[20] "It's Only the Himalayas"

我希望它的长度为 1000 本书而不是 20 本书,这将允许我使用相同的代码来计算指标,但需要 1000 本书而不是 20 本书。

代码:

url<-'http://books.toscrape.com/catalogue/page-1.html'

url %>%
  read_html() %>%
  html_nodes('h3 a') %>%
  html_attr('title')->titles
titles

从网站上抓取每本书并使列表长度为 1000 个而不是 20 个的最佳方法是什么?提前致谢。

【问题讨论】:

    标签: r web-scraping


    【解决方案1】:

    生成 50 个 URL,然后对其进行迭代,例如purrr::map

    library(rvest)
    
    urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
    
    titles <- purrr::map(
      urls, 
      . %>% 
        read_html() %>%
        html_nodes('h3 a') %>%
        html_attr('title')
    )
    

    【讨论】:

      【解决方案2】:

      大概是这样的吧?

      library(tidyverse)
      library(rvest)
      library(data.table)
      # Vector with URL's to scrape
      url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
      # Scrape to list
      L <- lapply( url, function(x) {
        print( paste0( "scraping: ", x, " ... " ) )
        data.table(titles = read_html(x) %>%
                    html_nodes('h3 a') %>%
                    html_attr('title') )
      })
      # Bind list to single data.table
      data.table::rbindlist(L, use.names = TRUE, fill = TRUE)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2010-12-01
        • 2020-06-16
        • 1970-01-01
        • 1970-01-01
        • 2017-07-25
        • 2022-01-08
        • 2020-03-19
        相关资源
        最近更新 更多