如何在 R 中抓取 JSP 页面？答案

【问题标题】：How to scrape a JSP page in R?如何在 R 中抓取 JSP 页面？
【发布时间】：2018-05-30 05:02:14
【问题描述】：

我想在R中抓取以下页面的内容：http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.htm

但是，我找不到任何 HTML 标记或任何其他可以帮助我获取信息的工具。

我有兴趣使用“INHABILITADOS Y MULTADOS”部分的信息构建一个数据框，如下图所示：

This is the particular option I'm trying to scrape

选择此选项后，会出现一个包含多个提供程序的菜单，每个提供程序都有一个包含我想要重新收集的信息的特定表格。

The list of providers

The information I finally want to scrape

【问题讨论】：

对动态生成的 HTML 内容使用 RSeleneium

标签： r jsp web-scraping rvest httr

【解决方案1】：

通常，您可以使用 GET 方法进行请求。但是对于那个网站，你需要使用 POST 方法：

在 chrome 开发者模式下检查网络标签（按 F12）

在以下图片中，在 POST 请求正文中提交表单数据。

在 onclick 中查找模式：onlick 值用于提交表单

以下脚本应该可以工作：

library(httr)
library(rvest)
library(stringr)
library(dplyr)
my_url <- "http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.jsp"
my_ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"


#use post method instead of get to get correct response
response <- POST(my_url,
                 user_agent(my_ua),
                 body = list(cmdsan = "INHABILITA",
                             tipoqry = "INHABILITA",
                             mostrar_msg = "SI"),
                 encode = "form")


href_nodes <- content(response) %>%
  html_node("table") %>%
  html_nodes("a")

link_text <- href_nodes %>% 
  html_text() %>% 
  as.tibble() %>%
  rename(text = value)

form_items <- href_nodes %>% 
  html_attr("onclick") %>% # extract items to POST
  str_extract("(?<=\\().*?(?=\\))") %>% # extract everything inside brackets
  str_split("\\,",simplify = T) %>%# split POST items
  as.tibble() %>%
  mutate(V1 = str_sub(V1,start = 2,end =-2))


submit_table <- bind_cols(link_text,form_items)

#using POST method to get to the page you want
#for example, if you want to go to page A Y M CONSTRUCTORA, S.A. DE C.V (row 2)
#you should:

row_num <- 2

my_url2 <- "http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/FichaSinTabla.jsp"

response1 <- POST(my_url2,
                 user_agent(my_ua),
                 body = list(expe = submit_table$V1[row_num],
                             tipo = submit_table$V2[row_num],
                             persona = submit_table$V3[row_num]),
                 encode = "form")

submit_table 中的内容，稍后将用于发出 POST 请求以获取每个单独页面中的内容。

> submit_table 
# A tibble: 1,329 x 4
text                                        V1             V2    V3   
<chr>                                       <chr>          <chr> <chr>
  1 A AND P INTERNATIONAL                       185770002/2016 1     3    
2 A Y M CONSTRUCTORA, S.A. DE C.V.            000090121/2006 1     3    
3 A Y V INDUSTRIAL Y COMERCIAL, S.A. DE C.V.  184000001/2013 1     3    
4 A+D ARQUITECTOS, S.A. DE C.V.               181640187/2006 1     3    
5 A.D.C. Consultores y Servicios, S.A de C.V. 111510007/2005 1     3    
6 AARON VERA MORALES                          006410056/2011 1     3    
7 ABASTECEDORA DE FÁRMACOS, S.A. DE C.V.      006410002/2014 1     3    
8 ABASTECEDORA EZCO, S.A. DE C.V.             000070024/2016 1     3    
9 ABEL ZURITA MAYO                            000200012/2014 1     3    
10 ABS TECNOLOGÍA, S.A. DE C.V.                090850001/2016 1     3    
# ... with 1,319 more rows

您可以使用 rvest 中的函数通过响应提取这些元素：

(content(response1) %>% html_nodes(".normal") %>% html_text() %>% str_trim())[3]

将返回：

[1] "Publicación en el DOF: 05 DE ABRIL DE 2007Monto de la Multa: $ 72,540.00Plazo de inhabilitación: 3 MESESInicia: 06 DE ABRIL DE 2007Termina: 06 DE JULIO DE 2007"

【讨论】：

这对我的类似项目帮助很大。感谢您回答您自己的问题。