使用 R 抓取播放数据答案

【问题标题】：Using R to scrape play-by-play data使用 R 抓取播放数据
【发布时间】：2020-04-27 03:15:30
【问题描述】：

我目前正在尝试从以下链接中抓取播放条目： https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4

我使用 SelectorGadget 来确定 CSS 选择器并最终得到 '//td'。但是，当我尝试使用它来抓取数据时，html_nodes() 返回一个空列表，因此以下代码返回错误。

library("rvest")

url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"

play_by_play <- url %>% 
  read_html %>%  
  html_node(xpath='//td') %>% 
  html_table()
play_by_play

有人知道如何解决这个问题吗？

提前谢谢你！

【问题讨论】：

标签： r web-scraping css-selectors rvest

【解决方案1】：

我认为您无法仅仅因为网站中没有表格而获得表格（请参阅源代码）。它有任何表格，您可以使用以下代码获取它。

library("rvest")

url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"

play_by_play <- url %>% 
  read_html %>%  
  html_table() 
play_by_play

【讨论】：

【解决方案2】：

您正在加载的页面中的数据是使用 Javascript 加载的，因此当您使用 read_html 时，您并没有看到您想要的内容。如果勾选“查看源代码”，将不会在源代码页面中看到 table 或 td。

您可以做的是使用 Rselenium 等其他选项来获取页面源，如果您以后想使用 rvest，您可以从获得的源中抓取。

library(rvest)
library(Rselenium)

url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"

rD<- rsDriver()

remDr <- rD$client
remDr$navigate(url)
remDr$getPageSource()[[1]]

play_by_play <-read_html(unlist(remDr$getPageSource()),encoding="UTF-8") %>%
  html_nodes("td")

remDr$close()
rm(remDr, rD)
gc()

【讨论】：