没有硒或phantomjs的R web抓取情节跟踪悬停文本答案

【问题标题】：R web scraping plotly trace hover text without selenium or phantomjs没有硒或phantomjs的R web抓取情节跟踪悬停文本
【发布时间】：2021-04-25 17:54:47
【问题描述】：

我正在尝试从网络上发布的一些情节跟踪中抓取悬停文本内容。我以前没有执行过这种类型的抓取，如果可能的话，我试图在没有 selenium 或 phantomjs 的 R 中执行此操作......也许使用 V8？我想知道是否有人可以指出我正确的方向。链接到地块如下。专门寻找图 21 中的数据：按区域划分的艾伯塔省 COVID-19 阳性率。谢谢！

https://www.alberta.ca/stats/covid-19-alberta-statistics.htm

【问题讨论】：

标签： javascript html r web-scraping rvest

【解决方案1】：

使用rvest 和jsonlite，以下代码将为您提供您正在寻找的数据。 plot.ly 图表的数据存储在<script> 标签中。

第一步是识别感兴趣图形的小部件ID，下面的代码向您展示了如何通过查找感兴趣图形的标题文本来找到小部件ID。然后您可以使用html_nodes() 和html_attrs() 搜索正确的节点。 jsonlite::fromJSON() 将 JSON 数据转换为 R 列表对象。

library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(dplyr)


url <-
  "https://www.alberta.ca/stats/covid-19-alberta-statistics.htm#laboratory-testing"

raw_html <- read_html(url)

# get widget ID

caption <-
  "Figure 21: Positivity rate for COVID-19 in Alberta by zone."

figure_divs <- html_nodes(raw_html, ".figure")

figure_21_div_lgl <- grepl(caption, figure_divs)

widget_id <-
  figure_divs[figure_21_div_lgl] %>%
  html_nodes("div") %>%
  html_attr("id")

# find data for the correct widget_id

data_for <-
  html_nodes(raw_html, "script") %>%
  html_attr("data-for")

data_for_figure_21_lgl <-
  !is.na(data_for) & data_for == widget_id

data_for_figure_21 <-
  html_nodes(raw_html, "script") %>%
  .[data_for_figure_21_lgl] %>%
  html_text()

dff21_l <- fromJSON(data_for_figure_21)

为了提取工具提示中显示的数据（“悬停文本”），我们需要遍历不同的元素。首先用html_text()提取DOM结构。之后，我们使用html_text() 提取文本。我们对元素进行了多次迭代以拆分和清理字符串，以便最终将结果转换为data.frame。

tooltip_text_raw <- unlist(dff21_l$x$data$text)
tooltip_text <- map(tooltip_text_raw, read_html)
tooltip_text <- map(tooltip_text, html_text) %>% unlist()

tooltip_text_split <- strsplit(tooltip_text, "\\:")

tooltip_text_split_almost_clean <-
  map(tooltip_text_split,
      ~ gsub("Report Date|Percent|Number of tests", "", .x))

tooltip_text_split_clean <-
  map(tooltip_text_split_almost_clean, ~ str_squish(.[. != ""]))

tests_df <-
  map_dfr(tooltip_text_split_clean,
          ~ data.frame(
            date = as.Date(.x[1]),
            percent = .x[2],
            tests = .x[3]
          ))

head(tests_df)
#>         date percent tests
#> 1 2020-03-06    9.68    31
#> 2 2020-03-07    0.00   142
#> 3 2020-03-08    0.00   213
#> 4 2020-03-09    2.51   239
#> 5 2020-03-10    3.90   282
#> 6 2020-03-11    1.05   572

【讨论】：

这非常有帮助，谢谢！我不得不在前半部分更改几个对象：asd = raw_html 并且小部件的名称发生了一些变化，但我能够找到它。我想知道小部件是否定期重命名？我得提防一些事情。数据导出仅适用于按地区划分的案例数据，我有兴趣收集他们的测试数据。再次感谢。
我更新了我的答案以包含通过图形标题查找小部件 ID 的代码。如果这回答了您的问题，请考虑接受我的回答作为正确答案。