【问题标题】:Cannot identify html node for scraping in rvest无法识别用于在 rvest 中抓取的 html 节点
【发布时间】:2020-11-21 15:44:54
【问题描述】:

试图从页面中抓取链接以进行后续分析,但只能抓取其中的 1/2,这可能是由于过滤造成的。我正在尝试提取此处突出显示的链接:

我的方法如下,这并不理想,因为我相信我可能会在filter() 调用中丢失一些链接。

library(rvest)
library(tidyverse)

#initiate session
session <- html_session("https://www.backlisted.fm/episodes")

#collect links for all episodes from the index page:

session %>% 
  read_html() %>% 
  html_nodes(".underline-body-links a") %>% 
  html_attr("href") %>% 
  tibble(link_temp = .) %>% 
  filter(str_detect(link_temp, pattern = "episodes/")) %>%
  distinct()

#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
 
#result:

link_temp                                                                        
   <chr>                                                                            
 1 /episodes/116-mfk-fisher-how-to-cook-a-wolf                                      
 2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women               
 3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody                  
 4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona         
 5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
 6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man           
 7 /episodes/114-william-golding-the-inheritors                                     
 8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia                    
 9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me                  
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king         
# … with 43 more rows

我一直在阅读多个文档,但我无法针对那一种类型的 href。任何帮助都感激不尽。谢谢。

【问题讨论】:

    标签: css r screen-scraping rvest


    【解决方案1】:

    试试这个

    library(rvest)
    library(tidyverse)
    
    session <- html_session("https://www.backlisted.fm/index")
    raw_html <- read_html(session)
    node <- raw_html %>% html_nodes(css = "li p a")
    link <- node %>% html_attr("href")
    title <- node %>% html_text()
    tibble(title, link)
    
    # A tibble: 117 x 2
    #    title                                          link                                                                     
    #    <chr>                                          <chr>                                                                    
    #  1 "A Month in the Country"                       https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country     
    #  2 " - J.L. Carr (with Lissa Evans)"              #                                                                        
    #  3 "Good Morning, Midnight - Jean Rhys"           https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight     
    #  4 "It Had to Be You - David Nobbs"               https://www.backlisted.fm/episodes/3-david-nobbs-1                       
    #  5 "The Blessing - Nancy Mitford"                 https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing          
    #  6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
    #  7 "Passing - Nella Larsen"                       https://www.backlisted.fm/episodes/6-nella-larsen-passing                
    #  8 "The Great Fire - Shirley Hazzard"             https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire      
    #  9 "Lolly Willowes - Sylvia Townsend Warner"      https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
    # 10 "The Information - Martin Amis"                https://www.backlisted.fm/episodes/9-martin-amis-the-information         
    # … with 107 more rows
    

    【讨论】:

    • 非常感谢 - 我也很欣赏包含标题的额外分隔
    猜你喜欢
    • 2017-03-22
    • 1970-01-01
    • 1970-01-01
    • 2019-07-21
    • 2023-03-21
    • 1970-01-01
    • 1970-01-01
    • 2021-11-24
    • 2018-08-26
    相关资源
    最近更新 更多