【发布时间】:2016-09-08 08:24:45
【问题描述】:
我正在尝试抓取页面 https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads 并可以使用 rvest 很好地删除文本数据
library(plyr)
library(XML)
library(rvest)
library(dplyr)
library(magrittr)
library(data.table)
for(i in 1:16)
{
float <- paste("squad", i, sep ="")
print(float)
html = read_html("https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads")
assign(float, html_table(html_nodes(html, "table")[[i]]))
}
但还想为此添加一个额外的列,其中包含俱乐部每张桌子上的 URL。例如小队 1(页面上的波兰小队,仅显示前 5 名球员)
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
2 2 2DF Sebastian Boenisch (1987-02-01)1 February 1987 (aged 25) 9 0 Werder Bremen
3 3 2DF Grzegorz Wojtkowiak (1984-01-26)26 January 1984 (aged 28) 19 0 Lech Poznań
4 4 2DF Marcin Kamiński (1992-01-15)15 January 1992 (aged 20) 3 0 Lech Poznań
5 5 3MF Dariusz Dudka (1983-12-09)9 December 1983 (aged 28) 65 2 Auxerre
6 6 3MF Adam Matuszczyk (1989-02-14)14 February 1989 (aged 23) 20 1 Fortuna Düsseldorf
我想在“club”之后的“clubURL”列中显示该俱乐部的维基百科网址。例如,第一个球员为阿森纳效力,所以要为阿森纳取得桌子上的链接并创建:
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
clubURL
1 https://en.wikipedia.org/wiki/Arsenal_F.C.
等等等等。我找到了rvest table scraping including links,但无法使该示例起作用,也无法用于我想做的事情。对不起,如果有人问过,
谢谢,
【问题讨论】:
标签: r href screen-scraping rvest