【问题标题】:Scrape Data through RVest通过 RVest 抓取数据
【发布时间】:2025-12-15 12:15:02
【问题描述】:

我希望按类别从https://www.inquirer.net/article-index?d=2020-6-13 获取文章名称

我尝试通过以下方式阅读文章名称:

library('rvest')

year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")

 pg <- read_html(url)

 test<-pg %>%
  html_nodes("#index-wrap") %>%
  html_text()

这仅返回所有文章名称的 1 个字符串,而且非常混乱。

我最终希望有一个如下所示的数据框:

       Date     Category      Article Name
 2020-06-13         News      ‘We can never let our guard down’ vs terrorism – Cayetano
 2020-06-13         News      PNP spox says mañanita remark did not intend to put Sinas in bad light
 2020-06-13         News      After stranded mom’s death, Pasay LGU helps over 400 stranded individuals
 2020-06-13        World      4 dead after tanker truck explodes on highway in China
 etc.
 etc.
 etc.
 etc.
 2020-06-13    Lifestyle     Book: Melania Trump delayed 2017 move to DC to get new prenup

有人知道我可能会错过什么吗?非常新,谢谢!

【问题讨论】:

  • 你好。您是否尝试过read_html(url) 而不仅仅是url
  • 是的,我试过了,它只返回一个长字符串的结果

标签: r rvest


【解决方案1】:

这可能是你能得到的最接近的:

library(rvest)
#> Loading required package: xml2
library(tibble)

year  <- 2020
month <- 06
day   <- 13
url   <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)

div       <- read_html(url) %>% html_node(xpath = '//*[@id ="index-wrap"]')
links     <- html_nodes(div, xpath = '//a[@rel = "bookmark"]') 
post_date <- html_nodes(div, xpath = '//span[@class = "index-postdate"]') %>% 
             html_text()

test <- tibble(date = post_date,
               text = html_text(links),
               link = html_attr(links, "href"))

test
#> # A tibble: 261 x 3
#>    date     text                              link                              
#>    <chr>    <chr>                             <chr>                             
#>  1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#>  2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#>  3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#>  4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#>  5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#>  6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#>  7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#>  8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#>  9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows

reprex package (v0.3.0) 于 2020 年 6 月 14 日创建

【讨论】:

  • 只是好奇,我想知道你做了什么。我想我明白xpath = //*[@id ="index-wrap"] 你怎么知道那是位置?我正在尝试使用 SelectorGadget 复制它。书签部分也是如此
  • 因为基本上我想要做的就是再增加 1 列包含干净的文章文本。并尝试这样做:article_data% html_nodes("#art_body_wrap") %>% html_text()
  • 我也可以将上述问题作为新问题发布
  • 目前将其用作教程,这可能会有所帮助flukeout.github.io
  • @nak5120 作为一个新问题可能会更好,否则我们可能会在很长一段时间内评论乒乓球。
【解决方案2】:

你fogot read_html() 然后在dplyr statment中使用它

library('rvest')

year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")

#added page
page <- read_html(url)

test <- page %>%
  #changed xpath
  html_node(xpath = '//*[@id ="index-wrap"]') %>%
  html_text()

test

更新,我很讨厌 dplyr,但这是我睡觉前的东西

library('rvest')

year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")

#addad page
page <- read_html(url)

titles <- page %>%
  html_nodes(xpath = '//*[@id ="index-wrap"]/h4') %>%
  html_text()

sections <- page %>%
  html_nodes(xpath = '//*[@id ="index-wrap"]/ul')


stories <- sections %>%
  html_nodes(xpath = '//li/a') %>%
  html_text()

stories

【讨论】:

  • 谢谢,我想我可以完成这项工作。看起来现在所有内容都在 1 个字符串中,只需要从 \r\n\t\t\t\t\tNEWS\r\n\t\t\t\t\t\t 字符串中解析出类别.有没有更好的方法让这个更干净?