【问题标题】:How to extract desired sections from a JSON string如何从 JSON 字符串中提取所需的部分
【发布时间】:2021-01-28 06:24:50
【问题描述】:

我想知道如何清理我的数据以更好地理解它,以便我知道如何更轻松地筛选数据。到目前为止,我已经能够下载一个公开的 google 电子表格文档,然后将其转换为 csv 文件。但是当我打印数据时,它非常混乱且难以理解。数据来自一个网站,所以当我进入谷歌开发者模式时,我可以看到它是如何整齐地组织起来的。

像这样: Website data on inspect page mode

但实际上当我在 Jupyter 笔记本中打印时看到它看起来像这样混乱:

b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","re​​qId":"0output=csv","status":"ok" ,"sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id": "B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type": "date","pattern":"yyyy-mm-dd"},{"id":"D","label":"航班 2019 (参考)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern" :"常规"},{"id":"F","label":"% 与 2019 年相比 (每日)","type":"number","pattern":"General"},{"id":"G","label":"Flights (7天搬家 平均)","type":"number","pattern":"General"},{"id":"H","label":"% vs 2019(7天搬家) 平均)","type":"number","pattern":"General"},{"id":"I","label":"Day 2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day 以前的 年份","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"航班 以前的 年份","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0," f":"36"},{"v":"日期(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129 "},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f": "71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"日期(2019,8,3)","f":"2019- 09-03"},{"v":"日期(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"} ]},{"c":[{"v":"阿尔巴尼亚"},{"v":36.0,"f":"36"},{"v":"日期(2020,8,2)" ,"f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v" :-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735" },{"v":"日期(2019,8,4)","f":"2019-09-04"},{"v":"日期(2019,8,4)","f" :"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"阿尔巴尼亚"},{"v":36.0 ,"f":"36"},{"v":"日期(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f": "96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},

有 Panda 方法来保持这些数据吗?

基本上我要做的是从数据中提取三个变量:国家、日期和数字。

这里可以看到代码是如何以标题“行”开头的:

Code in Jupyter showing how the code starts out

基本上它给出了一个国家、日期,然后是一堆相关的数字。

我想得到的是国家名称、具体日期和具体号码。

例如,这是一个示例部分,此序列在整个数据中重复:

{"c":[{"v":"阿尔巴尼亚"},{"v":36.0,"f":"36"},{"v":"日期(2020,8,1)" ,"f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v" :-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0, 2913719943"},{"v":"日期(2019,8,3)","f":"2019-09-03"},{"v":"日期(2019,8,3)"," f":"2019-09-03"},{"v":129.0,"f":"129"}]},

这部分数据我只想取出单词国名:阿尔巴尼亚,日期“2020-09-01”,数字-0.5038

这是我用来抓取 google 电子表格数据并将其保存为 csv 的代码:

import requests
import pandas as pd 

r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')

data = r.content

print(data)

请任何和所有建议都会很棒。

谢谢

【问题讨论】:

    标签: python json pandas


    【解决方案1】:

    我不确定您是如何获得此 csv 文件的,但最简单的方法是直接通过请求获取 json,将其作为 dict 加载并进行处理。尽管如此,当前文件的解决方案是:

    import requests
    import pandas as pd 
    import json
    
    r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
    
    data = r.content
    data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
    d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
    df = pd.DataFrame(d, columns=['country', 'date', 'number'])
    
    Output:
    |    | country   | date       |        number |
    |---:|:----------|:-----------|--------------:|
    |  0 | Albania   | 2020-09-01 |     -0.503876 |
    |  1 | Albania   | 2020-09-02 |     -0.358696 |
    |  2 | Albania   | 2020-09-03 |     -0.302083 |
    |  3 | Albania   | 2020-09-04 |     -0.135922 |
    |  4 | Albania   | 2020-09-05 |     -0.43617  |
    

    【讨论】:

    • 你也可以将data分割成data = json.loads(data[47:-2])
    • @RJ Adriaansen,谢谢!有没有什么办法让它专门拿出一个国家的名字,然后抓住它的具体日期和号码?我需要提取特定国家及其相关数据点。
    • @RJ Adriaansen,这也是我从中抓取数据的网站:eurocontrol.int/Economics/DailyTrafficVariation-States.html。我去检查页面并进入 XHR 并查看 GET 请求来自何处。不知道如何json它。很想知道怎么做。
    • 不,现在我看到网站本身以这种格式从谷歌电子表格加载它,所以你可以接受我的代码。在 pandas 中可以轻松完成按国家/地区过滤:df[df['country'] == 'France']
    • @RJ Adriaanse,我很抱歉我的问题我现在看到 pandas 只打印前 5 个。这是一个了不起的答案,非常感谢
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-03-01
    • 1970-01-01
    • 2021-08-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多