【问题标题】:In R extract a declared variable from html在 R 中,从 html 中提取声明的变量
【发布时间】:2018-02-17 10:34:11
【问题描述】:

有没有办法使用 rvest(或任何其他包)从网站中提取变量声明,例如

var global_tmp_status   =   0;

var global_goal_scored_overtime = [
      ['x', 'Headed', 'Left foot', 'Right foot', 'Other', 'Overall'],
      ['14/8/2016', 1,  0,  2,  0,  3]]; </script

我想将 global_goal_scored_overtime 中的数据提取为表格?

谢谢

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    您可以通过以下优秀的V8 包进行评估:

    require(rvest)
    require(V8)
    txt <- "<!DOCTYPE html>
    <html>
    <body>
    
    <script>
    var global_tmp_status = 0;
    var global_goal_scored_overtime = [ ['x', 'Headed', 'Left foot', 'Right foot', 'Other', 'Overall'], ['14/8/2016', 1, 0, 2, 0, 3]];
    </script> 
    
    </body>
    </html>"
    # probably you need another selector to "find" your script...
    script <- read_html(txt) %>% html_node("script") %>% html_text(trim=TRUE)
    ctx <- v8()
    ctx$eval(script)
    ctx$get("global_tmp_status")
    ctx$get("global_goal_scored_overtime")
    

    导致:

    > ctx$get("global_tmp_status")
    [1] 0
    

    > ctx$get("global_goal_scored_overtime")
         [,1]        [,2]     [,3]        [,4]         [,5]    [,6]     
    [1,] "x"         "Headed" "Left foot" "Right foot" "Other" "Overall"
    [2,] "14/8/2016" "1"      "0"         "2"          "0"     "3"  
    

    【讨论】: