使用 rvest 将函数映射到抓取的链接列表时遇到问题答案

【问题标题】：Trouble mapping a function to a list of scraped links using rvest使用 rvest 将函数映射到抓取的链接列表时遇到问题
【发布时间】：2023-09-30 02:36:01
【问题描述】：

我正在尝试应用一个从抓取的链接列表中提取表格的函数。我正处于将get_injury_data 函数应用于链接的最后阶段——我在成功执行此操作时遇到了问题。我收到以下错误：

    Error in matrix(unlist(values), ncol = width, byrow = TRUE) : 
    'data' must be of a vector type, was 'NULL'

我想知道是否有人可以帮助我找出我哪里出错了。代码如下：

library(tidyverse)
library(rvest)

# create a function to grab the team links

get_team_links <- function(url){
  url %>%
  read_html %>%
  html_nodes('td.hauptlink a') %>%
  html_attr('href') %>%
  .[. != '#'] %>% # remove rows with # string 
  paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
  unique() %>% # keep only unique links
  as_tibble() %>% # turn strings into a tibble datatset
  rename("links" = "value") %>%  # rename the value column 
  filter(!grepl('profil', links)) %>% # remove link of players included 
  filter(!grepl('spielplan', links)) %>%  # remove link of additional team pages included
  mutate(links = gsub("startseite", "kader", links)) # change link to go to the  detailed page
}

# create a function to grab the player links
get_player_links <- function(url){
  url %>%
  read_html %>%
  html_nodes('td.hauptlink a') %>%
  html_attr('href') %>%
  .[. != '#'] %>% # remove rows with # string 
  paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
  unique() %>% # keep only unique links
  as_tibble() %>% # turn strings into a tibble datatset
  rename("links" = "value")  %>%  # rename the value column 
  filter(grepl('profil', links)) %>% # remove link of players included
  mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}

# create a function to get the injury dataset
get_injury_data <- function(url){
  url %>% 
  read_html() %>%
  html_nodes('#yw1') %>%
  html_table()
}

# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')

# get player links and by mapping the function on to the player_injury_links dataset 
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>% 
  mutate(links = map(team_links$links, get_player_links)) %>% 
  unnest(links)

# using the player_injury_links list create a dataset by web scrapping the play injury pages 
player_injury_data <- map(player_injury_links$links, get_injury_data)

【问题讨论】：

您是否找到了导致此错误的链接？
@QHarr 不，不确定如何最好地做到这一点。
@QHarr 实际上有些球员没有伤病数据，所以在创建 html_table 时不确定如何最好地接近这些球员
包裹在 tryCatch 中？
@QHarr 错误依然存在。

标签： r web-scraping mapping rvest

【解决方案1】：

解决方案

所以我遇到的问题是我正在抓取的一些链接没有任何数据。

为了解决这个问题，我使用了purrr 包中的possibly 函数。这帮助我创建了一个新的、无错误的函数。

给我带来麻烦的代码如下：

player_injury_data <-  player_injury_links %>%  
  purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))

【讨论】：