【问题标题】:R: Web scraping JSON, extracting information from nestR:Web 抓取 JSON,从巢中提取信息
【发布时间】:2017-05-25 17:01:09
【问题描述】:

我正在尝试使用 tidyJSON 从 JSON 中提取信息,但我愿意接受任何可以达到我目的的 R 包。我查看了文档和插图,发现complex example 很有帮助。但是,我想要的信息嵌套在非键值对中,我不知道如何访问它。我有兴趣获取appidnamedeveloper等,但这些信息在570730之内:

{"570":{"appid":570,"name":"Dota 2","developer":"Valve","publisher":"Valve","score_rank":71,"owners":102151578,"owners_variance":259003,"players_forever":102151578,"players_forever_variance":259003,"players_2weeks":9436299,"players_2weeks_variance":89979,"average_forever":11727,"average_2weeks":1229,"median_forever":277,"median_2weeks":662,"ccu":811259,"price":"0","tags":{"Free to Play":22678,"MOBA":7808,"Strategy":7415,"Multiplayer":6757,"Team-Based":4848,"Action":4602,"e-sports":4089,"Online Co-Op":3669,"Competitive":3553,"PvP":2655,"RTS":2267,"Difficult":2129,"RPG":2114,"Fantasy":2044,"Tower Defense":2024,"Co-op":1898,"Character Customization":1514,"Replay Value":1487,"Action RPG":1397,"Simulation":1024}},

"730":{"appid":730,"name":"Counter-Strike: Global Offensive","developer":"Valve","publisher":"Valve","score_rank":78,"owners":29225079,"owners_variance":154335,"players_forever":28552354,"players_forever_variance":152685,"players_2weeks":9102348,"players_2weeks_variance":88410,"average_forever":17648,"average_2weeks":791,"median_forever":5030,"median_2weeks":358,"ccu":543626,"price":"1499","tags":{"FPS":17082,"Multiplayer":13744,"Shooter":12833,"Action":10881,"Team-Based":10369,"Competitive":9664,"Tactical":8529,"First-Person":7329,"e-sports":6716,"PvP":6383,"Online Co-Op":5714,"Military":4621,"Co-op":4435,"Strategy":4424,"War":4361,"Realistic":3196,"Trading":3191,"Difficult":3158,"Fast-Paced":3100,"Moddable":2496}}

有成千上万个这样的条目。有没有办法跳过“顶级”并在巢中查看?
JSON信息来自http://steamspy.com/api.php?request=top100in2weeks

【问题讨论】:

  • 您可以先尝试listviewer::jsonedit 帮助您将数据可视化。也许jsonlitecouçd 可以帮助您提取所需的内容。

标签: json r web-scraping jsonlite


【解决方案1】:

这可能是你需要的:

library(jsonlite)
data = fromJSON("http://steamspy.com/api.php?request=top100in2weeks")

appid = lapply(data, function(x){x$appid})
name = lapply(data, function(x){x$name})

df = data.frame(appid = unlist(appid),
                name = unlist(name),
                stringsAsFactors = F)

结果:

> head(df)
        appid                             name
570       570                           Dota 2
730       730 Counter-Strike: Global Offensive
578080 578080    PLAYERUNKNOWN'S BATTLEGROUNDS
440       440                  Team Fortress 2
271590 271590               Grand Theft Auto V
433850 433850           H1Z1: King of the Kill

我会让你添加其余的信息

编辑:将数组添加到数据框

可以在数据框中为每个游戏添加标签信息。以及时间标记。对于每个游戏,您必须在一列中存储一组标签名称,在另一列中存储标签数量。

df的定义后添加以下几行:

for(k in 1:nrow(d)){
    d$tags[k] = list(names(data[[k]]$tags))
    d$tagsQ[k] = list(unlist(data[[k]]$tags))
}

这会给你:

> d["570",]
    appid   name
570   570 Dota 2

tags
570 Free to Play, MOBA, Strategy, Multiplayer, Team-Based, Action, e-sports, Online Co-Op, Competitive, PvP, RTS, Difficult, RPG, Fantasy, Tower Defense, Co-op, Character Customization, Replay Value, Action RPG, Simulation

tagsQ
570 22686, 7810, 7420, 6759, 4850, 4603, 4092, 3672, 3555, 2657, 2267, 2130, 2116, 2045, 2024, 1898, 1514, 1487, 1397, 1023

在这种情况下,tagstagsQ 列包含列表。要获取appid 570 的第二个标签和数量,请执行以下操作:

> df["570","tags"][[1]][2]
[1] "MOBA"

> d["570","tagsQ"][[1]][2]
MOBA 
7810

【讨论】:

  • 谢谢。我还在努力将“标签”字段转换为可以放入数据框中的数据结构。我最终得到了一个无法插入数据框中的命名列表。有没有一种简单的方法可以将标签转换为数据帧中的虚拟布尔列,或者将其连接成数据帧字段中的逗号分隔值?我真的不擅长列表结构。
猜你喜欢
  • 1970-01-01
  • 2016-05-23
  • 2021-04-20
  • 2019-09-06
  • 1970-01-01
  • 2014-10-07
  • 2018-08-12
  • 2018-01-06
  • 1970-01-01
相关资源
最近更新 更多