从多个网页中提取多段文本答案

【问题标题】：Extracting multiple pieces of text from multiple web pages从多个网页中提取多段文本
【发布时间】：2017-02-06 18:38:16
【问题描述】：

这段代码的第一部分（直到“pages”）成功地检索到我要从中抓取的页面。然后，我正在努力寻找一种方法来提取带有相关日期的文章文本片段作为数据框。

我明白了：

UseMethod("read_xml") 中的错误：没有适用于“c('xml_document', 'xml_node')”类对象的“read_xml”方法

也欢迎任何关于优雅、清晰和效率的指导，因为这是个人学习。

library(rvest)
library(tidyverse)
library(plyr)
library(stringr)

llply(1:2, function(i) {

  read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>% 
    html_nodes(".Headline--regular a") %>% 
    html_attr("href") %>%
    url_absolute("http://www.thetimes.co.uk")

}) -> links

pages <- links %>% unlist() %>% map(read_html)

map_df(pages, function(x) {

  text = read_html(x) %>% 
    html_nodes(".Article-content p") %>% 
    html_text() %>% 
    str_extract(".+skills.+")

  date = read_html(x) %>% 
    html_nodes(".Dateline") %>% 
    html_text()

}) -> article_df

【问题讨论】：

标签： r web-scraping purrr rvest

【解决方案1】：

很好，你快到了！这里有两个错误：

变量pages 已包含已解析的html 代码。因此，在单个页面上（即在 map_df 内）再次应用 read_html 不起作用。这是您收到的错误消息。
map_df 中的函数不正确。由于没有显式返回，因此返回最后一个计算值，即date。变量text 完全被遗忘了。您必须将这两个变量打包在一个数据框中。

以下包含固定代码。

article_df <- map_df(pages, function(x) {
  data_frame(
    text = x %>% 
      html_nodes(".Article-content p") %>% 
      html_text() %>% 
      str_extract(".+skills.+"),

    date = x %>% 
      html_nodes(".Dateline") %>% 
      html_text()
  )
})

还有一些关于代码本身的cmets：

我认为最好使用<- 而不是->。这样一来，人们可以更轻松地找到变量的分配位置，如果使用“说出变量名”，则更容易理解代码。
我更喜欢使用包purrr 而不是plyr。 purrr 是 tidyverse 包的一部分。因此，您可以简单地使用map，而不是函数llply。在purrr 和plyr 上有一个nice article。

links <- map(1:2, function(i) {
  read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>% 
    html_nodes(".Headline--regular a") %>% 
    html_attr("href") %>%
    url_absolute("http://www.thetimes.co.uk")
})

【讨论】：

非常感谢您。非常感谢提示和修复。在这个例子的上下文中，tibble 与 data.frame 有什么优缺点吗？
我会说是的。如果您只是将函数 data_frame() 替换为 data.frame()（即创建一个 data.frame 而不是 tibble），您会收到警告消息 In bind_rows_(x, .id) : Unequal factor levels: coercing to character 您应该使用选项 stringsAsFactors = FALSE 来避免这些（有很多关于默认情况下由stringsAsFactors = TRUE 引起的问题）。除此之外，如果您有更多链接，tibble 会更好：默认情况下不会打印整个 tibble，而是仅打印前十行。