如何根据标题抓取具有不同 URL 的多个页面的网页数据？答案

【问题标题】：How to scrape web data of multiple pages which have different URL based on title?如何根据标题抓取具有不同 URL 的多个页面的网页数据？
【发布时间】：2019-04-28 00:36:53
【问题描述】：

我正在从 URL http://iias.ac.in/recent-publications 抓取网络数据。我已经使用“rvest”抓取了该页面所有标题的数据。现在我有一个包含书名的向量

书名书 [1] “泰戈尔的一些论文：历史。社会。政治”
[2] “不可见的网络：对 Jangarh Singh Shyam 生与死的艺术历史探究” ..

现在我正在抓取每本书的数据，其 url 是基于这样的书名 http://iias.ac.in/publication/some-essays-tagore-history-society-politics

作为矢量 title_book 包含公共 url 的后缀“http://iias.ac.in”如何一次抓取所有此类 URL 的数据。

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

看来需要一些数据清理步骤。我强烈推荐stringr 包。以下是我的做法。

title_book = c("Some Essays of Tagore : History. Society. Politics",
  "INVISIBLE WEBS: An art Historical inquiry into the life and death of Jangarh Singh Shyam")

title_book_edited = title_book %>% 
  str_to_lower() %>% 
  str_replace_all(pattern = " ", replacement = "-") %>% 
  str_remove_all(pattern = ":") %>% 
  str_remove_all(pattern = "\\.")

title_book_list = paste0("http://iias.ac.in/publication/", title_book_edited)

我使用str_to_lower() 转换字符串的大小写，str_replace_all() 替换所有匹配的模式，str_remove_all() 删除所有匹配的模式。输出看起来像这样。

> title_book_list
[1] "http://iias.ac.in/publication/some-essays-of-tagore--history-society-politics"                                        
[2] "http://iias.ac.in/publication/invisible-webs-an-art-historical-inquiry-into-the-life-and-death-of-jangarh-singh-shyam"

访问this official document 了解更多信息。希望对您有所帮助。

【讨论】：