在 R 中使用 rvest 为网页抓取准备多个 URL答案

【问题标题】：Preparing multiple URLs for webscraping with rvest in R在 R 中使用 rvest 为网页抓取准备多个 URL
【发布时间】：2020-02-29 14:47:51
【问题描述】：

我使用 rvest 抓取多个 URL 时得到不一致的结果。连接的 URL 字符串返回一个字符向量。运行 html_nodes 会返回三个不同的结果。

library(rvest)
 url <- c("https://interestingengineering.com/due-to-the-space-inside-atoms-you-are-mostly- 
          made-up-of-empty-space",
          "https://futurism.com/mit-tech-self-driving-cars-see-under-surface-road",
          "https://techxplore.com/news/2020-02-socially-robot-children-autism.html",
          "https://eos.org/science-updates/hackathon-speeds-progress-toward-climate-model- 
          collaboration",
          "https://www.smithsonianmag.com/innovation/new-study-finds-people-prefer-robots- 
           explain-themselves-180974299/",
           "https://www.sciencedaily.com/releases/2020/02/200227144259.htm")

      page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())

此代码将返回从所有 URL 提取的内容。

或者它会给出这个错误信息：

open.connection(x, "rb") 中的错误：处理内容未编码时出错：设置的代码长度无效

或者这个错误信息：

总结期间出错：HTTP 错误 410。

在最后一条错误消息之后，我还在控制台中获得了 Browse[1]>。

我尝试从 CSV 文件运行 URL：

   urldoc<- read.csv("URLs for rvest.csv", stringsAsFactors=FALSE, sep = ",")
   page <-map(urldoc, ~read_html(.x) %>% html_nodes("p") %>% html_text())

print(urldoc) 输出看起来与串联代码中的输出相似，但我收到了不同的错误消息：

doc_parse_file 中的错误（con，encoding = encoding，as_html = as_html，options = options）：期望单个字符串值：[type=character;范围=83]

我无法在数据框上运行html_node 或html_text。

1) 如何获得无差错一致的回报。
2) 更好的是，如何使用带有 URL 而不是串联字符串的文档？

【问题讨论】：

您当前的“url”向量在 url 本身中包含一些换行符，这将导致错误。一旦我纠正了这一点，我就无法在上面重现你的错误。错误 410 异常意味着页面无效，因此请仔细检查所有网址是否正确。您的最后一个问题，确保每个 url 在一行中，并且在您的 csv 文件中每行只有 1 个 url。
谢谢！不幸的是，即使在清理了 csv 之后，我也会遇到同样的错误。它不会接受向量，而是需要一个字符串值。当我将它们串联运行时它可以工作，但是我遇到了 HTTP 问题。我猜当它遇到一个非工作的 url 时会停止抓取。所以我需要找到一种方法来忽略这些 url。

标签： html r string csv rvest

【解决方案1】：

您的第一个问题似乎是由您的网址上的换行符引起的。

至于您的第二个问题：我可以从 .csv 中重现您的问题。这是我找到的解决方案。

urldoc<- read.csv("URLs for rvest.csv", stringsAsFactors=FALSE, sep = ",", header=FALSE)
page <-map(urldoc[,1], ~read_html(.x) %>% html_nodes("p") %>% html_text())

确保您的 .csv 每行只有一个 URL，并指定要从中读取的列。

【讨论】：

谢谢。我稍微清理了 csv，并确保使用 url 对列进行子集化。不幸的是，错误仍然存在。