我无法抓取新闻网站的 URL答案

【问题标题】：I cannot scrape the URL's of a news website我无法抓取新闻网站的 URL
【发布时间】：2019-12-02 21:42:38
【问题描述】：

我尝试使用 Rvest 收集网站的 URl，但我使用的节点/标签（“node-title”）似乎不包含每个链接的“href”。但是，如果我使用相同的节点/标签来收集主页中的 URL（我正在尝试抓取搜索部分），它确实有效。

#Getting the dynamic URL using %d

url_espectador <- 'https://www.elespectador.com/search/proceso paz farc?page=%d'

#The original website is https://www.elespectador.com/search/proceso%20de%20paz?page=1

#Reading through the pages and collecting website elements
map_df(1:10, function(i) {
  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))

  data.frame(link = str_trim(html_attr(html_nodes(pagina, ".node-title"), "href")),
                      stringsAsFactors=FALSE)
  }) -> titulos_espectador

我得到的是每个字符串的 NA。有人可以帮忙吗？谢谢！

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

node-title 用于父元素。你想要孩子a 标签。所以css选择器

.node-title a

请注意，这会返回相对链接，因此您可能希望添加前缀。

library(rvest)
library(stringr)
library(magrittr)
library(purrr)

url_espectador <- 'https://www.elespectador.com/search/proceso paz farc?page=%d'

#The original website is https://www.elespectador.com/search/proceso%20de%20paz?page=1

#Reading through the pages and collecting website elements
map_df(1:2, function(i) {
  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))

  data.frame(link = paste0("https://www.elespectador.com",str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
}) -> titulos_espectador

【讨论】：

这是你想要的吗？
谢谢，@QHarr 它运行良好。但是，我现在在收集每个链接的内容时遇到了麻烦。我正在使用 content_espectador = lapply(titulos_espectador[ , 1], function(x) {read_html(x) %>% html_nodes(".node-body") %>% html_text %>% as.character}) 收集文章内容还包含许多其他信息（window.setContentCreated = \"2014-03-28T09:37:19-0500\";\n window.setContentAuthor = \"EFE\";\n window.setContentSection = \"Posconflicto\ ";\n) 我不知道如何识别只检索内容的标签
明确你想要从每个页面获得什么内容以及你希望它如何输出。
我将每个链接的文章内容（文本）。既然每个链接都是一条新闻，那我就去收集一下新闻内容。我想把它作为一个数据框。然而，使用标签“node-body”我确实抓取了每个链接的新闻内容，但还有许多其他数据、数字和信息（不仅仅是新闻的文本）