尝试使用 R 中的 for 循环从多个网页中抓取表格答案

【问题标题】：Trying to scrape a table from multiple webpages with a for loop in R尝试使用 R 中的 for 循环从多个网页中抓取表格
【发布时间】：2020-10-16 01:28:08
【问题描述】：

我正在尝试从多个网页中为不同的 MLB 球队抓取信息。这些是我试图从https://www.covers.com/sport/baseball/mlb/teams/main/miami-marlins/2019 和https://www.covers.com/sport/baseball/mlb/teams/main/cleveland-indians/2019 中抓取的网站。对于两个团队，我都试图从页面上的第 12 个表中抓取信息，然后将它们作为数据框连接在一起。到目前为止，我的代码看起来像这样

library(rvest)
#> Loading required package: xml2
library(magrittr)
teams= c("miami-marlins", "cleveland-indians")

tables <- list()
index <- 1
for(i in teams){
  url <- paste0("https://www.covers.com/sport/baseball/mlb/teams/main/",(i),"/2019")
  table <- url %>% 
    read_html() %>% 
    html_nodes("table")%>%
    .[[12]]%>%
    html_table()
  
  tables[index] <- table
  
  index <- index + 1
  
  
}
#> Warning in tables[index] <- table: number of items to replace is not a multiple
#> of replacement length

#> Warning in tables[index] <- table: number of items to replace is not a multiple
#> of replacement length
df <- do.call("rbind", tables)

^{由reprex package (v0.3.0) 于 2020 年 10 月 15 日创建} 当我运行代码时，我收到了上述警告消息，并且代码只获取了两支球队进行比赛的日期。我主要从Trying to use rvest to loop a command to scrape tables from multiple pages 帖子中借用了代码，然后尝试对其进行一些调整以适应我的需要，但显然我的一些改动把它搞砸了。下面我发布了我编写的代码，用于从各个网站上抓取表格，效果很好。

url15 <- paste0("https://www.covers.com/sport/baseball/mlb/teams/main/miami-marlins/2019")
table <- url15 %>% 
  read_html() %>% 
  html_nodes("table")%>%
  .[[12]]%>%
  html_table()
#> Error in url15 %>% read_html() %>% html_nodes("table") %>% .[[12]] %>% : could not find function "%>%"

^{由reprex package (v0.3.0) 于 2020 年 10 月 15 日创建}

如果有人能指出我在这里做错了什么，如果可能的话，我会很感激，因为我对此很陌生。

【问题讨论】：

你可能想要tables[[index]] <- table，因为tables是一个列表而不是一个向量。

标签： r for-loop web-scraping

【解决方案1】：

试试这个

library(rvest)
library(dplyr)
teams <- c("miami-marlins", "cleveland-indians")
dplyr::bind_rows(lapply(
  paste0("https://www.covers.com/sport/baseball/mlb/teams/main/", teams, "/2019"), 
  . %>% read_html() %>% html_nodes("table") %>% .[[12]] %>% html_table() %>% {`names<-`(.[-1L, ], .[1L, , drop = TRUE])}
))

【讨论】：