【问题标题】:using rvest to scrape match scores from crickbuzz in R使用rvest从R中的crickbuzz中刮取比赛分数
【发布时间】:2016-01-09 05:08:31
【问题描述】:

我正在抓取页面Crickbuzz scores 以获取比赛详情。我正在使用选择器小工具来获取 css 标记。到目前为止我所做的事情是:

crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)") %>%
html_text()

我已获取比赛、比分和场地,但难以获取日期。 我从上面的代码得到以下结果

> matches_dates
     "   -     " "   -     " "   "       "   "       "   "       "   "   "  "      
    "   "       "   "       "   "       "   -     " "   -     " "   -     "

表示获取21个元素,也就是目前有21个匹配,但没有获取文本。

然后我看到了 html_nodes() 中的内容 它就像:

{xml_nodeset (21)}
 1 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">    
   </span>
2 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">    
   </span>
3 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">    
   </span> and so on....

这意味着我没有从标签中获取文本。 该怎么做?

【问题讨论】:

    标签: r rvest


    【解决方案1】:

    你需要使用时间戳属性来提取它:

    library(rvest)
    crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
    matches_dates <- crickbuzz %>%
        html_nodes(".schedule-date:nth-child(1)")%>%
       html_attr("timestamp")
    
    matches_dates
     [1] "1452268800000" "1452132000000" "1452247200000" "1452242400000" "1452327000000" "1452290400000" "1452310200000" "1452310200000" "1452310200000"
    [10] "1452310200000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452150000000" "1452153600000" "1452153600000"
    
    # this is the unix time and so if you need to convert to date-time format, follow the answer
     to this question: 
    http://stackoverflow.com/questions/13456241/convert-unix-epoch-to-date-object-in-r
    

    【讨论】:

      猜你喜欢
      • 2018-03-26
      • 2018-01-03
      • 1970-01-01
      • 2019-10-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多