【问题标题】：Scraping PDFs of all linked websites抓取所有链接网站的 PDF
【发布时间】：2020-08-27 10:09:12
【问题描述】：

我想从网站上抓取官方法律（这里是example）。可以在 html 网站的菜单中访问这些文档。我设法从 github 等网站提取链接并下载 PDF，但是，我很难从此类网站提取链接。我尝试了以下代码：

library(rvest)

# read html 
page <- read_html("https://bl.clex.ch/app/de/texts_of_law/780")

# from nodes I would like to get the links where the PDFs are stored
raw_list <- page %>%   # takes the page above for which we've read the html
  html_nodes("a") %>%  # find all links in the page
  html_attr("href")

由于结果为空字符串，在该网站上找不到链接

character(0)

我的问题：

与通过 github 项目主页上的链接访问的存储在 github 上的 PDF 相比，链接网站上的菜单有什么不同？
如何访问链接并下载此菜单中存储的所有 PDF？

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

显然，您要抓取的网站是基于 angular 的网站。即它使用xhr 请求来加载内容。因此，在查看了 Chrome - Network 选项卡中 XHR 请求中的开发人员工具之后。

你会发现网站调用https://bl.clex.ch/api/de/texts_of_law/780（基本上是把app改成api）这个请求返回一个JSON字符串。

我尝试使用 jsonlite 对其进行解析，但它给出了错误，因此我使用正则表达式来匹配其中包含 pdf_link 的所有条目。

library(RCurl)
uri <- "https://bl.clex.ch/app/de/texts_of_law/780"
json <- getURL(sub('/app/', '/api/', uri, fixed=T))
stringr::str_match_all(json, '"(pdf_link[a-z_]*?)":"(.+?)",')[[1]][, 2:3]

输出

     [,1]                        [,2]                                                                                          
[1,] "pdf_link"                  "http://bl.clex.ch/frontend/versions/pdf_file_with_annex/1337?locale=de"                      
[2,] "pdf_link_tol"              "http://bl.clex.ch/frontend/versions/1337/download_pdf_file?locale=de"                        
[3,] "pdf_link_annexes"          "http://bl.clex.ch/frontend/structured_documents/3473/download_pdf_annex?locale=de"           
[4,] "pdf_link_tol_with_annexes" "http://bl.clex.ch/frontend/structured_documents/3473/download_pdf_file_and_annex?locale=de"  
[5,] "pdf_link"                  "http://bl.clex.ch/frontend/change_document_file_dictionaries/194/download_pdf_file?locale=de"

【讨论】：