【问题标题】:Parsing unstructured json into csv将非结构化的 json 解析为 csv
【发布时间】:2018-04-07 16:39:45
【问题描述】:

我有 json 格式的不同应用程序的年度应用程序数据。每个应用程序有 10 个不同的 json 文件。我尝试将它们合并到一个 csv 中。先给大家看一下数据结构:

[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]

当我将它们解析为 pandas 数据框时,我会得到如下信息:

date         downloads  end         data

2017-10-23   15358985   2017-10-23  {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22   12778233   2017-10-22  {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}

请注意,并非每天都会下载所有版本。如何为不同版本的应用程序创建列?如果应用程序在特定日期未下载,我们可以将其留空或填写 NaN

【问题讨论】:

  • 你试过pd.io.json.json_normalize(your_dict)吗?我认为这是一个骗局。
  • 可以分享一下链接吗?
  • 在你的情况下,只需函数调用就足够了,仅此而已。
  • 我重新打开了问题,因为 OP 需要为缺失的天数添加 NaNs 行。

标签: python json pandas csv


【解决方案1】:

我认为您需要 DataFrame 构造函数和 reindex 来添加缺失的行:

j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]

df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)

df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
                                                         data   downloads  \
2017-10-22  {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2...  12778233.0   
2017-10-23                                                NaN         NaN   
2017-10-24                                                NaN         NaN   
2017-10-25  {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42...  15358985.0   

                   end  
2017-10-22  2017-10-22  
2017-10-23         NaN  
2017-10-24         NaN  
2017-10-25  2017-10-23  

使用json_normalize 的解决方案,但如果jsons 的不同格式得到很多NaNs 值:

df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
            data.1.0.1  data.1.0.2  data.2.2.3.1-signed  data.2.3.1.1-signed  \
2017-10-22         NaN         NaN                  NaN                  3.0   
2017-10-23         NaN         NaN                  NaN                  NaN   
2017-10-24         NaN         NaN                  NaN                  NaN   
2017-10-25       268.0       715.0               9292.0                  NaN   

            data.2.4.1  data.2.6.10  data.2.6.4.1-signed  \
2017-10-22       842.0      11538.0                  8.0   
2017-10-23         NaN          NaN                  NaN   
2017-10-24         NaN          NaN                  NaN   
2017-10-25         NaN          NaN                  NaN   

            data.2.7.2.4151-beta  data.2.7.3.4196-beta  data.2.7.3.4198-beta  \
2017-10-22                   NaN                   5.0                   4.0   
2017-10-23                   NaN                   NaN                   NaN   
2017-10-24                   NaN                   NaN                   NaN   
2017-10-25                   1.0                   7.0                   NaN   

            data.2.7.3.4215-beta  data.2.9.0.4250-beta  data.2.99.0.1857beta  \
2017-10-22                   NaN                   NaN                   4.0   
2017-10-23                   NaN                   NaN                   NaN   
2017-10-24                   NaN                   NaN                   NaN   
2017-10-25                   2.0                   1.0                   NaN   

            data.2.99.0.1872beta   downloads         end  
2017-10-22                  12.0  12778233.0  2017-10-22  
2017-10-23                   NaN         NaN         NaN  
2017-10-24                   NaN         NaN         NaN  
2017-10-25                   NaN  15358985.0  2017-10-23  

【讨论】:

  • 不错的答案,我没有看到缺少天数的要求。
猜你喜欢
  • 1970-01-01
  • 2017-10-26
  • 1970-01-01
  • 1970-01-01
  • 2014-11-20
  • 1970-01-01
  • 2018-05-16
  • 2015-09-14
  • 1970-01-01
相关资源
最近更新 更多