【问题标题】:Scraping Javascript-Rendered Content in R from a Webpage without Unique URL从没有唯一 URL 的网页中抓取 R 中的 Javascript 渲染内容
【发布时间】:2020-04-13 11:37:30
【问题描述】:

我想从South African National Lottery 网站上抓取南非 LOTTO 抽奖的历史结果(尤其是总彩池大小、总销售额等)。默认情况下,人们会看到最近十次抽奖结果的链接,或者可以选择一个日期范围来拉出一组更大的抽奖链接(每页仍然只显示十个)。

在浏览器中悬停在链接上,例如'LOTTO DRAW 2012' 我们看到javascript:void();,所以很明显抽奖结果将使用 Javascript 呈现。阅读R Web Scraping Cheat Sheet 上的建议后,我意识到我需要打开 Google Chrome 开发者工具,然后打开网络选项卡,然后单击指向“LOTTO DRAW 2012”的链接。当我这样做时,我可以看到 this url 正在被 initiator 调用

当我右键单击启动器并选择“复制响应”时,我可以在看起来是 JSON 代码的“drawDetails”对象中看到我需要的数据。

{"code":200,"message":"OK","data":{"drawDetails":{"drawNumber":"2012","drawDate":"2020\/04\/11","nextDrawDate":"2020\/04\/15","ball1":"48","ball2":"6","ball3":"43","ball4":"41","ball5":"25","ball6":"45","bonusBall":"38","div1Winners":"1","div1Payout":"10546013.8","div2Winners":"0","div2Payout":"0","div3Winners":"28","div3Payout":"7676.4","div4Winners":"62","div4Payout":"2751.4","div5Winners":"1389","div5Payout":"206.3","div6Winners":"1872","div6Payout":"133","div7Winners":"28003","div7Payout":"50","div8Winners":"20651","div8Payout":"20","rolloverAmount":"0","rolloverNumber":"0","totalPrizePool":"13280236.5","totalSales":"11610950","estimatedJackpot":"2000000","guaranteedJackpot":"0","drawMachine":"RNG2","ballSet":"RNG","status":"published","winners":52006,"millionairs":1,"gpwinners":"52006","wcwinners":"0","ncwinners":"0","ecwinners":"0","mpwinners":"0","lpwinners":"0","fswinners":"0","kznwinners":"0","nwwinners":"0"},"totalWinnerRecord":{"lottoMillionairs":28716702,"lottoWinners":337285646,"ithubaMillionairs":135763,"ithubaWinners":305615802}},"videoData":[{"id":"1049","listid":"1","parentid":"1","videosource":"youtube","videoid":"chHfFxVi9QI","imageurl":"","title":"LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW 2012 (11 APRIL 2020)","description":"","custom_imageurl":"","custom_title":"","custom_description":"","specialparams":"","lastupdate":"0000-00-00 00:00:00","allowupdates":"1","status":"0","isvideo":"1","link":"https:\/\/www.youtube.com\/watch?v=chHfFxVi9QI","ordering":"10001","publisheddate":"2020-04-11 20:06:17","duration":"182","rating_average":"0","rating_max":"0","rating_min":"0","rating_numRaters":"0","statistics_favoriteCount":"0","statistics_viewCount":"329","keywords":"","startsecond":"0","endsecond":"0","likes":"6","dislikes":"0","commentcount":"0","channel_username":"","channel_title":"","channel_subscribers":"9880","channel_subscribed":"0","channel_location":"","channel_commentcount":"0","channel_viewcount":"0","channel_videocount":"1061","channel_description":"","channel_totaluploadviews":"0","alias":"lotto-lotto-plus-1-and-lotto-plus-2-draw-2012-11-april-2020","rawdata":"","datalink":"https:\/\/www.googleapis.com\/youtube\/v3\/videos?id=chHfFxVi9QI&part=id,snippet,contentDetails,statistics&key=AIzaSyC1Xvk2GUdb_N3UiFtjsgZ-uMviJ_8MFZI"}]}

这是一个POST类型的请求,所以我尝试关注this answer,但找不到onclick的值,表示与表单一起提交的数据。此外,“LOTTO DRAW 2012”的请求 URL 与“LOTTO DRAW 2011”的请求 URL 相同,因此与 URL 本身一起传递的特定抽奖没有唯一标识符。因此,我不清楚对特定抽奖结果的独特要求是如何提出的。

因此,较小的问题是,给定特定的 LOTTO 抽奖号码或抽奖日期,如何找出用于针对与该抽奖有关的数据发出 POST 请求的唯一标识符?

更大的问题是,如果能够获得所有历史抽奖的唯一标识符,如何依次为所有历史抽奖生成JSON drawDetails对象,否则完成抓取操作?

【问题讨论】:

  • 点击您对该侧面板感兴趣的特定请求。然后单击Headers 并向下滚动。看看有没有Query Form 之类的。
  • 存在Form Data,其值为gameNamedrawNumber;这些一起将唯一标识平局。谢谢 - 所以这回答了第一个问题。进一步的问题是如何在 R 中为给定的drawNumber 值运行该请求,以生成 JSON drawDetails 对象。

标签: javascript r web-scraping


【解决方案1】:

你是对的 - 页面上的内容是由 javascript 通过 ajax 请求更新的。服务器返回一个 json 字符串以响应 http POST 请求。对于 POST 请求,服务器的响应不仅取决于您请求的 url,还取决于您发送到服务器的消息的正文。在这种情况下,您的正文是一个具有 3 个字段的简单表单:gameName,始终为 LOTTOisAjax,始终为 true,和 drawNumber,这是您要更改的字段。

如果您使用httr,则将这些字段指定为POST 函数的body 参数中的命名列表。

获得每次抽奖的响应后,您需要使用 jsonlite 等库将 json 解析为 R 友好格式,例如列表或数据框。通过查看这个特定 json 的结构,提取组件 $data$drawDetails 并使其成为单行数据框是最有意义的。这将允许您将多个绘图绑定到一个数据帧中。

这是一个为您完成所有这些的函数:

lotto_details <- function(draw_numbers)
{
 do.call("rbind", lapply(draw_numbers, function(x)
 {
   res <- httr::POST(paste0("https://www.nationallottery.co.za/index.php",
                            "?task=results.redirectPageURL&amp;",
                            "Itemid=265&amp;option=com_weaver&amp;",
                            "controller=lotto-history"),
                     body = list(gameName = "LOTTO", drawNumber = x, isAjax = "true"))
   as.data.frame(jsonlite::fromJSON(httr::content(res, "text"))$data$drawDetails)
 }))
}

你是这样使用的:

lotto_details(2009:2012)
#>   drawNumber   drawDate nextDrawDate ball1 ball2 ball3 ball4 ball5 ball6
#> 1       2009 2020/04/01   2020/04/04    51    15     7    32    42    45
#> 2       2010 2020/04/04   2020/04/08    43     4    21    24    10     3
#> 3       2011 2020/04/08   2020/04/11    42    43     8    18     2    29
#> 4       2012 2020/04/11   2020/04/15    48     6    43    41    25    45
#>   bonusBall div1Winners div1Payout div2Winners div2Payout div3Winners
#> 1         1           0          0           0          0          21
#> 2        22           0          0           0          0          31
#> 3        34           0          0           0          0          21
#> 4        38           1 10546013.8           0          0          28
#>   div3Payout div4Winners div4Payout div5Winners div5Payout div6Winners
#> 1     8455.3          60     2348.7        1252        189        1786
#> 2     6004.3          71     2080.6        1808      137.3        2352
#> 3     8584.5          60     2384.6        1405      171.1        2079
#> 4     7676.4          62     2751.4        1389      206.3        1872
#>   div6Payout div7Winners div7Payout div8Winners div8Payout rolloverAmount
#> 1      115.2       24664         50       19711         20     3809758.17
#> 2       91.7       35790         50       25981         20     5966533.86
#> 3      100.5       27674         50       21895         20     8055430.87
#> 4        133       28003         50       20651         20              0
#>   rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1              2     6198036.67    9879655          6000000
#> 2              3     9073426.56   11696905          8000000
#> 3              4    10649716.37   10406895         10000000
#> 4              0     13280236.5   11610950          2000000
#>   guaranteedJackpot drawMachine ballSet    status winners millionairs
#> 1                 0        RNG2     RNG published   47494           0
#> 2                 0        RNG2     RNG published   66033           0
#> 3                 0        RNG2     RNG published   53134           0
#> 4                 0        RNG2     RNG published   52006           1
#>   gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1     47494         0         0         0         0         0         0
#> 2     66033         0         0         0         0         0         0
#> 3     53134         0         0         0         0         0         0
#> 4     52006         0         0         0         0         0         0
#>   kznwinners nwwinners
#> 1          0         0
#> 2          0         0
#> 3          0         0
#> 4          0         0

reprex package (v0.3.0) 于 2020-04-13 创建

【讨论】:

  • 太棒了,谢谢!我同时得出了一个几乎相同的解决方案,尽管您的解决方案更优雅
【解决方案2】:

我已经接受了这个问题的满意答案(见上文)。我同时得出了一个几乎相同的解决方案;我在这里添加它只是因为它明确涵盖了所有可用的抽奖号码,并且会自动检测最新的抽奖号码,以便将来可以“按原样”运行代码,前提是国家彩票网站设计保持不变。

theurl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&amp;Itemid=265&amp;option=com_weaver&amp;controller=lotto-history"
x <- rvest::html_text(xml2::read_html(theurl))
preceding_string <- "LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW "
drawnums <- as.integer(vapply(gregexpr(preceding_string, x)[[1]] + nchar(preceding_string), 
              function(k) substr(x, start = k, stop = k + 3), NA_character_))
drawnumrange <- 1506:max(drawnums)
response <- lapply(drawnumrange, function(d) httr::POST(url = theurl, 
                body = list(gameName = "LOTTO", drawNumber = as.character(d), isAjax = 
                "true"), encode = "form"))
jsondat <- lapply(response, function(r) jsonlite::parse_json(r)$data$drawDetails)
lottotable <- as.data.frame(do.call(rbind, jsondat))
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)
xlsx::write.xlsx2(lottotable[1:37], "lottotable.xlsx", row.names = FALSE)

【讨论】:

    猜你喜欢
    • 2019-04-19
    • 2021-02-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-03-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多