如何使用 rvest 提取此节点内容答案

【问题标题】：How to extract this node content using rvest如何使用 rvest 提取此节点内容
【发布时间】：2021-06-02 13:24:14
【问题描述】：

我正在抓取this website。我对提取在最后一个脚本节点script node snippet 中找到的内容特别感兴趣。到目前为止，我尝试了以下方法：

url <- "https://insolvencyinsider.ca/filing/"
ii <- read__html(url)
fwp <- ii %>%
  htl_nodes("body") %>%
  xml_find_first(xpath = "/script[15]") %>%
  html_text() # Not text so I wouldn't expect this to work.

#> character (empty)


fwp <- ii %>%
  htl_nodes("body") %>%
  xml_find_first(xpath = "/script[15]") %>%
  html_attr("window.FWP_JSON") # Don't think this makes sense since its not an attribute?

 #> chr NA

【问题讨论】：

标签： r web-scraping rvest

【解决方案1】：

您可以使用以下模式对其进行正则表达式，然后使用 jsonlite 进行解析

library(rvest)
library(jsonlite)
library(stringr)

text <- read_html('https://insolvencyinsider.ca/filing/') %>% 
  toString()

data <- stringr::str_match(text, 'window\\.FWP_JSON = (.*?);\\n')[,2]

result <- jsonlite::parse_json(data)

正则表达式：

【讨论】：