【问题标题】:Scraping from transfermarkt with R package rvest使用 R 包 rvest 从 transfermarkt 抓取
【发布时间】:2019-01-17 23:24:51
【问题描述】:

我正在学习抓取数据,并且正在使用 transfermakt,但今天我遇到了两个问题。我用过选择器小工具。我的代码是这样的:

library(rvest)


url <- "https://www.transfermarkt.es/fc-granada/startseite/verein/16795"
webpage <- read_html(url)

players_html  <- html_nodes(webpage,"#yw1 .tooltipstered") 
players <- html_text(players_html) 
players

valores_html <- html_nodes(webpage,'.rechts.hauptlink')
valores <- html_text(valores_html)
valores
valores <- gsub(" miles €","000", valores)
valores <- gsub(" mill. €","0000", valores)
valores
valores <- gsub(",","",valores)
valores <- gsub(" ","", valores)
valores

我在选择球员时遇到了第一个问题。这是输出。

> players_html  <- html_nodes(webpage,"#yw1 .tooltipstered")
> players <- html_text(players_html)
> players
character(0)

我认为问题出在CSS选择器上,但它是在选择播放器时向我显示Selector Gadget的那个,所以我不知道如何解决这个问题。

另一个问题是选择它们的市场价值。 Gsub 不会删除一些最终的空格,以避免将字符作为数字。这是输出:

> valores_html <- html_nodes(webpage,'.rechts.hauptlink')
> valores <- html_text(valores_html)
> valores
[1] "700 miles €  "  "300 miles €  "  "800 miles €  "  "500 miles €  "  
"300 miles €  " 
[6] "300 miles €  "  "1,00 mill. €  " "300 miles €  "  "1,20 mill. €  
" "500 miles €  " 
[11] "1,70 mill. €  " "1,50 mill. €  " "1,00 mill. €  " "800 miles €  
"  "800 miles €  " 
[16] "300 miles €  "  "2,00 mill. €  " "800 miles €  "  "700 miles €  
"  "400 miles €  " 
[21] "700 miles €  "  "1,00 mill. €  " "800 miles €  " 
> valores <- gsub(" miles €","000", valores)
> valores <- gsub(" mill. €","0000", valores)
> valores
[1] "700000  "   "300000  "   "800000  "   "500000  "   "300000  "   
"300000  "   "1,000000  "
[8] "300000  "   "1,200000  " "500000  "   "1,700000  " "1,500000  " 
"1,000000  " "800000  "  
[15] "800000  "   "300000  "   "2,000000  " "800000  "   "700000  "   
"400000  "   "700000  "  
[22] "1,000000  " "800000  "  
> valores <- gsub(",","",valores)
> valores <- gsub(" ","", valores)
> valores
[1] "700000  "  "300000  "  "800000  "  "500000  "  "300000  "  
"300000  "  "1000000  " "300000  " 
[9] "1200000  " "500000  "  "1700000  " "1500000  " "1000000  " 
"800000  "  "800000  "  "300000  " 
[17] "2000000  " "800000  "  "700000  "  "400000  "  "700000  "  
"1000000  " "800000  " 

基本上,用于删除最终空白的最后一个 gsub 在这种情况下没有任何作用。有人可以帮我解决这两个问题吗?

PS:我使用的是西班牙语的 transfermarkt。

【问题讨论】:

    标签: r regex web-scraping gsub rvest


    【解决方案1】:

    至于gsub,我们可以用

    valores <- html_text(valores_html)
    valores <- gsub(" miles €", "000", valores)
    valores <- gsub(" mill. €", "0000", valores)
    valores <- gsub("\\D", "", valores)
    valores
    #  [1] "700000"  "300000"  "800000"  "500000"  "300000"  "300000"  "1000000" "300000"  "1200000"
    # [10] "500000"  "1700000" "1500000" "1000000" "800000"  "800000"  "300000"  "2000000" "800000" 
    # [19] "700000"  "400000"  "700000"  "1000000" "800000" 
    

    其中\\D 不是数字。

    我们可以写玩家名字

    players_html  <- html_nodes(webpage,"#yw1 span.hide-for-small a.spielprofil_tooltip")
    players <- html_text(players_html) 
    players
    #  [1] "Rui Silva"             "Aarón Escandell"       "Bernardo Cruz"        
    #  [4] "José Antonio Martínez" "Germán Sánchez"        "Pablo Vázquez"        
    #  [7] "Álex Martínez"         "Adrián Castellano"     "Víctor Díaz"          
    # [10] "Quini"                 "Nicolás Aguirre"       "Fede San Emeterio"    
    # [13] "Ángel Montoro"         "Fran Rico"             "Alberto Martín"       
    # [16] "José Antonio González" "Alejandro Pozo"        "Antonio Puertas"      
    # [19] "Fede Vico"             "Daniel Ojeda"          "Álvaro Vadillo"       
    # [22] "Adrián Ramos"          "Rodri"          
    

    通过这种方式,我们也只能得到一组(完整)名称。例如,使用"#yw1 a.spielprofil_tooltip" 也会返回它们的短版本。

    【讨论】:

    • 谢谢。为什么选择器小工具没有显示正确的 CSS 选择器?我想我必须在网络的源代码中寻找,但效率不高。您是如何获得正确的选择器的?
    • @MiguelAnguita,我也不是这方面的专家。我刚刚尝试了 Selector Gadget,也找不到如何从中获得好的 CSS。对于答案,我使用 Safari Web Inspector 将我指向源代码中的正确位置,然后我自己查看了那里。很抱歉有更好的答案,但一个好的工具似乎确实有帮助!
    猜你喜欢
    • 1970-01-01
    • 2018-10-13
    • 1970-01-01
    • 2020-04-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-31
    相关资源
    最近更新 更多