使用 Pandas 在巨大的 CSV 中解析带有嵌套值的 JSON 列答案

【问题标题】：Using Pandas to parse a JSON column w/nested values in a huge CSV使用 Pandas 在巨大的 CSV 中解析带有嵌套值的 JSON 列
【发布时间】：2018-12-02 01:27:15
【问题描述】：

我有一个巨大的 CSV 文件（3.5GB 并且每天都在变大），其中包含正常值和一个名为“元数据”的列，其中包含嵌套的 JSON 值。我的脚本如下，目的只是将 JSON 列转换为每个键值对的普通列。我正在使用 Python3（Anaconda；Windows）。

import pandas as pd
import numpy as np
import csv
import datetime as dt

from pandas.io.json import json_normalize

for df in pd.read_csv("source.csv", engine='c', 
    dayfirst=True, 
    encoding='utf-8', 
    header=0,
    nrows=10,
    chunksize=2,
    converters={'Metadata':json.loads}):

    ## parsing code comes here

    with open("output.csv", 'a', encoding='utf-8') as ofile:
        df.to_csv(ofile, index=False, encoding='utf-8')

并且该列具有以下格式的 JSON：

{  
   "content_id":"xxxx",
   "parental":"F",
   "my_custom_data":{  
      "GroupId":"NA",
      "group":null,
      "userGuid":"xxxxxxxxxxxxxx",
      "deviceGuid":"xxxxxxxxxxxxx",
      "connType":"WIFI",
      "channelName":"VOD",
      "assetId":"xxxxxxxxxxxxx",
      "GroupName":"NA",
      "playType":"VOD",
      "appVersion":"2.1.0",
      "userEnvironmentContext":"",
      "vodEncode":"H.264",
      "language":"English"
   }
}

所需的输出是将上述所有键值对作为列。数据框将具有其他非 JSON 列，我需要添加从上述 JSON 解析的列。我尝试了json_normalize，但我不确定如何将json_normalize 应用到Series 对象，然后将其转换（或分解）为多个列。

【问题讨论】：

标签： python json python-3.x pandas csv

【解决方案1】：

直接在系列上使用json_normalize()，然后使用pandas.concat()将新数据框与现有数据框合并：

pd.concat([df, json_normalize(df['Metadata'])])

如果您不再需要包含 JSON 数据结构的旧列，可以添加 .drop('Metadata', axis=1)。

为my_custom_data 嵌套字典生成的列将以my_custom_data. 为前缀。如果嵌套字典 in 中的所有名称都是唯一的，则可以使用 DataFrame.rename() operation 删除该前缀：

json_normalize(df['Metadata']).rename(
    columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)

如果您使用其他方法将每个字典值转换为扁平结构（例如，使用flatten_json，那么您想使用Series.apply() 处理每个值，然后将每个结果字典作为pandas.Series() 返回对象：

def some_conversion_function(dictionary):
    result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
    return pd.Series(something_that_processes_dictionary_into_a_flat_dict)

然后您可以将Series.apply() 调用的结果（这将是一个数据帧）连接回您的原始数据帧：

pd.concat([df, df['Metadata'].apply(some_conversion_function)])

【讨论】：

感谢您的回复...我循环它的原因是 read_csv 在您使用 chunksize=xxx 时返回一个 Iterator .... 同时，我也收到一个 KeyError: 'group ' 来自 json_normalize。
@A.Ali：啊，我错了，我错过了那里的 chunksize 参数。
显然，json_normalize 不能很好地处理我的 json。它给了我一个 KeyError：'group'。使用 flatten_json 库我取得了更好的成功。现在我有一个 python dict，需要使用 dict 创建一个数据框，然后将其附加到现有的数据框。
@A.Ali：如果你有一个平面字典（所有值都是标量），只需使用 df[columnname].apply(pd.Series) 直接从中创建一个 Series()；这将返回一个新的数据框，使用 pd.concat() 就像使用 json_normalize() 一样。
def xflatten(js): return pd.Series(flatten_json.flatten(js))，然后是pd.concat([df.drop(['Metadata'], axis=1), df.Metadata.apply(xflatten)])