从R中的网页中提取json链接答案

【问题标题】：Extracting json links from a webpage in R从R中的网页中提取json链接
【发布时间】：2021-06-02 11:09:01
【问题描述】：

我有一些网址，例如：

https://www.ine.es/jaxiT3/Tabla.htm?t=30656&L=0
https://www.ine.es/jaxiT3/Tabla.htm?t=30813&L=0

等等

每个链接的右上角都有一个下载图标。单击它后，它会提供以 JSON 格式下载的选项。

JSON 链接如下所示：

https://servicios.ine.es/wstempus/js/es/DATOS_TABLA/30656?tip=AM&

我可以使用以下方式读取其中一个 JSON URL：

library(jsonlite)
out <- fromJSON("https://servicios.ine.es/wstempus/js/es/DATOS_TABLA/30656?tip=AM&")

我的问题是，如何从每个下载链接中提取所有 JSON URL？

使用以下方法找到按钮元素：

url <- "https://www.ine.es/jaxiT3/Tabla.htm?t=30656&L=0"
read_html(url) %>% 
  html_nodes("a") %>% 
  .[15]

但是，我不确定这是否适用于所有 URL。数据：

Data <- structure(list(index = c("2.1.1", "2.1.2", "2.1.3", "2.1.4", 
"2.1.5", "2.1.6"), title = c("Indicadores de renta media y mediana", 
"Distribución por fuente de ingresos", "Porcentaje de población con ingresos por unidad de consumo por debajo de determinados umbrales fijos por sexo", 
"Porcentaje de población con ingresos por unidad de consumo por debajo de determinados umbrales fijos por sexo y tramos de edad", 
"Porcentaje de población con ingresos por unidad de consumo por debajo de determinados umbrales fijos por sexo y nacionalidad", 
"Porcentaje de población con ingresos por unidad de consumo por debajo/encima de determinados umbrales relativos por sexo"
), link = c("https://www.ine.es/jaxiT3/Tabla.htm?t=30656&L=0", 
"https://www.ine.es/jaxiT3/Tabla.htm?t=30813&L=0", "https://www.ine.es/jaxiT3/Tabla.htm?t=30657&L=0", 
"https://www.ine.es/jaxiT3/Tabla.htm?t=30659&L=0", "https://www.ine.es/jaxiT3/Tabla.htm?t=30660&L=0", 
"https://www.ine.es/jaxiT3/Tabla.htm?t=30661&L=0"), provincia = c("Albacete", 
"Albacete", "Albacete", "Albacete", "Albacete", "Albacete")), row.names = c(NA, 
6L), class = "data.frame")

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

从外观上看，它们具有共同的结构。只需为 link 列中的每个链接提取 t 参数，并将其放入 json url 模板字符串中：

library(stringr)
Data$`json_link` <- lapply(Data$link, function(x) {sprintf('https://servicios.ine.es/wstempus/js/es/DATOS_TABLA/%s?tip=AM&', stringr::str_match(x, 't=(\\d+)')[,2])} )

【讨论】：