从 R（足球运动员）中的多个网页上的表格中抓取数据答案

【问题标题】：Scraping data from tables on multiple web pages in R (football players)从 R（足球运动员）中的多个网页上的表格中抓取数据
【发布时间】：2013-12-17 15:09:05
【问题描述】：

我正在为学校开展一个项目，我需要收集各个 NCAA 足球运动员的职业统计数据。每个玩家的数据都是这种格式。

http://www.sports-reference.com/cfb/players/ryan-aplin-1.html

我找不到所有球员的汇总，所以我需要一页一页地拉出每个传球得分冲球接球等html表的底行

每个玩家都按他们的姓氏分类，每个字母的链接都在这里。

http://www.sports-reference.com/cfb/players/

例如，每个姓 A 的玩家都可以在这里找到。

http://www.sports-reference.com/cfb/players/a-index.html

这是我第一次真正涉足数据抓取，所以我试图找到类似的问题和答案。我找到的最接近的答案是this question

我相信我可以使用非常相似的方法来切换页码和收集的玩家姓名。但是，我不确定如何更改它以查找玩家名称而不是页码。

Samuel L. Ventura 最近还谈到了有关 NFL 数据的数据抓取，可以在 here 找到。

编辑：

Ben 真的很有帮助，并提供了一些很棒的代码。第一部分效果很好，但是当我尝试运行第二部分时，我遇到了这个问题。

> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it! 
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") : 
 no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) : 
  unexpected '<' occurred on line 1
 use a command like
 x <- edit()
 to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
  deparse may be incomplete

这可能是一个错误，因为没有足够的内存（我目前使用的计算机上只有 4gb）。虽然我不明白错误

    > all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") : 
 no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

查看我的其他数据集，我的玩家真的只能追溯到 2007 年。如果有某种方法可以从 2007 年开始只拉人，这可能有助于缩小数据。如果我有一个我想提取名字的人的列表，我可以替换 lnk in

 links[[i]] <- paste0("http://www.sports-reference.com", lnk)

只有我需要的球员？

【问题讨论】：

您最好使用专门的网页抓取工具。我尝试在 R 中做类似的事情，但最终放弃并最终使用 Scrapy 将数据转储到 CSV 中，然后在 R 中对其进行分析。Scrapy 是用 Python 编写的，因此可能无法使用。其他语言中也有类似的框架：iRobot Visual Scraping、各种Rubygems等
您遇到的错误可能是由于您的互联网连接或体育网站服务器出现故障。我已经更新了处理错误的答案，它会跳过给出错误的 URL 并继续。我还没有完成它，但在过去的几个小时里进展顺利。如果您遇到更多问题，您应该接受您在此处获得的答案并发布一个新问题以重新审视它。在我发布的代码中，只有在您准备好所有表格后才能对 2007 年及以后的数据进行子设置。不过可能还有其他方法。
如果您有一个玩家列表，那么这将节省很多时间，因为我们可以在将 URL 列表全部抓取之前对其进行子集化。这可能是改进方法的前进方向。或者按照@aseidlitz 的建议尝试python 的scrapy，这里也有一些专家。我也成功使用了它，但目前我是 R 语言。
我现在已经完成了这段代码的完整运行（花了大约 5 小时，使用 6 Gb RAM，从未超过 40% 的使用率），它似乎工作得很好。 RData 文件在这里：fileswap.com/dl/tNJYJ9yrN (9 Mb)

标签： html xml r web-scraping rcurl

【解决方案1】：

您可以通过以下方式轻松获取所有播放器页面上所有表格中的所有数据...

首先列出所有玩家页面的 URL...

require(RCurl); require(XML)
n <- length(letters) 
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
  print(i) # keep track of what the function is up to
  # get all html on each page of the a-z index pages
  inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
  # scrape URLs for each player from each index page
  lnk <- unname(xpathSApply(inx_page, "//a/@href"))
  # skip first 63 and last 10 links as they are constant on each page
  lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
  # only keep links that go to players (exclude schools)
  lnk <- lnk[grep("players", lnk)]
  # now we have a list of all the URLs to all the players on that index page
  # but the URLs are incomplete, so let's complete them so we can use them from 
  # anywhere
  links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)

现在我们有一个包含大约 67,000 个 URL 的向量（看起来很多玩家，对吗？），所以：

其次，抓取每个 URL 处的所有表以获取它们的数据，如下所示：

# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
    # pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
  print(i)
  # error handling - skips to next URL if it gets an error
  result <- try(
    all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
  ); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names

结果看起来像这样（这只是输出的一个sn-p）：

all_tables
$`neli-aasa`
$`neli-aasa`$defense
   Year School Conf Class Pos Solo Ast Tot Loss  Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007   Utah  MWC    FR  DL    2   1   3  0.0 0.0   0   0      0  0  0   0  0  0
2 *2010   Utah  MWC    SR  DL    4   4   8  2.5 1.5   0   0      0  1  0   0  0  0

$`neli-aasa`$kick_ret
   Year School Conf Class Pos Ret Yds  Avg TD Ret Yds Avg TD
1 *2007   Utah  MWC    FR  DL   0   0       0   0   0      0
2 *2010   Utah  MWC    SR  DL   2  24 12.0  0   0   0      0

$`neli-aasa`$receiving
   Year School Conf Class Pos Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
1 *2007   Utah  MWC    FR  DL   1  41 41.0  0   0   0      0     1  41 41.0  0
2 *2010   Utah  MWC    SR  DL   0   0       0   0   0      0     0   0       0

最后，假设我们只想看看路过的桌子......

# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)

我们最终得到一个可供进一步分析的数据框（也只是一个 sn-p）...

             Year             School Conf Class Pos Cmp Att  Pct  Yds Y/A AY/A TD Int  Rate
james-aaron  1978          Air Force  Ind        QB  28  56 50.0  316 5.6  3.6  1   3  92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA    JR  QB 100 182 54.9 1135 6.2  6.0  5   3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA    SR  QB  77 148 52.0  828 5.6  4.3  4   6  99.8

【讨论】：

这真的很有帮助！！但是，我在第二部分遇到了一些问题，并对我的原始帖子进行了编辑。
@user2269255 我已经更新了代码以在不停止抓取的情况下处理错误。
由于内存限制，此代码无法在我最初拥有的计算机上运行。我宠坏了自己并升级了，现在可以说这就像一个魅力！