【问题标题】:Convert a nested JSON string to Pandas dataframes (and add "foreign key" to relate them)将嵌套的 JSON 字符串转换为 Pandas 数据帧(并添加“外键”以关联它们)
【发布时间】:2020-11-05 10:57:50
【问题描述】:

我有一个 CSV 文件,其中包含以下列:

  • customer_id:正是如此。
  • report_date:报告的创建日期。
  • json_report:一个 JSON 对象

JSON 对象是这样的:

{
    "Person": {
         "Name": {
              "FirstName": "John",
              "LastName": "Doe"
          },
          "Accounts": {
              "Account": [
                   {"AccountNumber":123, "AccountStatus": "G"},
                   {"AccountNumber":137, "AccountStatus": "B"},
                   {"AccountNumber":593, "AccountStatus": "VB"}
              ]
          },
          "Alerts": {
              "Alert": [
                  {"DT":"20200601", "Msg": "Lorem ipsum"},
                  {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}
              ]
          }
    }
}

如您所见,该对象中有嵌套的 JSON 对象和列表。此外,原始 CSV 文件中的其他行可能在 JSON 文件中包含更多元素。

我需要的是创建可以相互关联的 Pandas 数据框。按照上面的例子,我需要以下数据框:

  • Name,有列:
    • customer_id
    • `report_date'
    • FirstName
    • LastName
  • Accounts,有列:
    • customer_id
    • `report_date'
    • AccountNumber
    • AccountStatus
  • Alerts
    • customer_id
    • `report_date'
    • DT
    • Msg
    • Msg2

到目前为止,我一直在手动处理这个问题,识别 JSON 对象中的嵌套对象并相应地处理它们;但是,我知道在未来的某个时候,这将是不可持续的。

所以,我的问题是:有没有办法自动完成这项任务?


到目前为止我一直在做什么:

  1. 我将 CSV 文件作为 Pandas 数据框读取
  2. 我遍历每一行,读取customer_idreport_datejson_report
  3. 我将 JSON 报告转换为字典
  4. 我得到了相关的嵌套对象。
    • 如果嵌套对象是字典(例如Name),我添加customer_idreport_date 键值对,并将编辑后的字典添加到列表中(例如lst_names
    • 如果嵌套对象是一个列表(例如Accounts/Account,我将customer_idreport_date 键值对添加到每个嵌套字典中,然后我将每个字典添加到列表中(例如lst_accounts) .
  5. 我将每个列表都转换为 Pandas 数据框

拥有多个数据帧很重要,因为我需要每个数据帧执行不同的任务(即,如果可能,我不想使用json_normalize)。

【问题讨论】:

    标签: python json pandas dataframe json-normalize


    【解决方案1】:
    • 最简单的做法是读取 csv 并使用额外信息更新 JSON 数据。
    • 可以删除许多不必要的复杂性,以便更轻松地处理 JSON。

    更新 JSON 数据

    import csv
    import pandas as pd
    from ast import literal_eval
    
    # read in the csv file
    with open('test.csv', 'r') as f:
        data = list(csv.reader(f, delimiter=';'))
    
    # alter the json and create a list of only the json, which now contains all the information
    new_json = list()
    for i, (idx, date, json) in enumerate(data):
        if i > 0:
            json = literal_eval(json)  # convert the str to a dict
            json['id'] = idx  # add unique id
            json['date'] = date  # add report date
            json['Accounts'] = json['Person']['Accounts']['Account']  # move list to top level key
            json['Alerts'] = json['Person']['Alerts']['Alert']  # move list to top level key
            json['first_name'] = json['Person']['Name']['FirstName']   # move value to top level key
            json['last_name'] = json['Person']['Name']['LastName']   # move value to top level key
            json.pop('Person')  # remove because it's no longer needed
            new_json.append(json)  # append to list
    
    # print(new_json[0])
    {'Accounts': [{'AccountNumber': 123, 'AccountStatus': 'G'},
                  {'AccountNumber': 137, 'AccountStatus': 'B'},
                  {'AccountNumber': 593, 'AccountStatus': 'VB'}],
     'Alerts': [{'DT': '20200601', 'Msg': 'Lorem ipsum'},
                {'DT': '20200615', 'Msg': 'Dolor sit amet', 'Msg2': 'Lorem'}],
     'date': '20200601',
     'first_name': 'John1',
     'id': '123',
     'last_name': 'Doe1'}
    

    创建单独的数据框

    # create accounts
    accounts = pd.json_normalize(new_json, ['Accounts'], ['id', 'date'])
    
    # display(accounts.head())
       AccountNumber AccountStatus   id      date
    0            123             G  123  20200601
    1            137             B  123  20200601
    2            593            VB  123  20200601
    3            123             G  456  20200602
    4            137             B  456  20200602
    
    # create alerts
    alerts = pd.json_normalize(new_json, ['Alerts'], ['id', 'date'])
    
    # display(alerts.head())
             DT             Msg   Msg2   id      date
    0  20200601     Lorem ipsum    NaN  123  20200601
    1  20200615  Dolor sit amet  Lorem  123  20200601
    2  20200601     Lorem ipsum    NaN  456  20200602
    3  20200615  Dolor sit amet  Lorem  456  20200602
    4  20200601     Lorem ipsum    NaN  789  20200603
    
    # create name
    name = pd.json_normalize(new_json).drop(columns=['Accounts', 'Alerts'])
    
    # display(name)
        id      date first_name last_name
    0  123  20200601      John1      Doe1
    1  456  20200602      John2      Doe2
    2  789  20200603      John3      Doe3
    3  123  20200606      John1      Doe1
    

    test.csv中使用的数据:

    id;date;json
    123;20200601;{"Person": {"Name": {"FirstName": "John1", "LastName": "Doe1"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
    456;20200602;{"Person": {"Name": {"FirstName": "John2", "LastName": "Doe2"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
    789;20200603;{"Person": {"Name": {"FirstName": "John3", "LastName": "Doe3"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
    123;20200606;{"Person": {"Name": {"FirstName": "John1", "LastName": "Doe1"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
    

    作为一个函数

    from typing import List, Tuple  # used for type hints
    import csv
    import pandas as pd
    from ast import literal_eval
    
    
    def fix_json(data: List[List[str]]) -> List[dict]:
        new_json = list()
        for i, (idx, date, json) in enumerate(data):
            if i > 0:
                json = literal_eval(json)
                json['id'] = idx
                json['date'] = date
                json['Accounts'] = json['Person']['Accounts']['Account']
                json['Alerts'] = json['Person']['Alerts']['Alert']
                json['first_name'] = json['Person']['Name']['FirstName']
                json['last_name'] = json['Person']['Name']['LastName']
                json.pop('Person')
                new_json.append(json)
                
        return new_json
    
    
    def make_dataframes(file_path_name: str) -> Tuple[pd.DataFrame]:
        with open(file_path_name, 'r') as f:
            data = list(csv.reader(f, delimiter=';'))
            
        new_json = fix_json(data)
        
        accounts = pd.json_normalize(new_json, ['Accounts'], ['id', 'date'])
        alerts = pd.json_normalize(new_json, ['Alerts'], ['id', 'date'])
        names = pd.json_normalize(new_json).drop(columns=['Accounts', 'Alerts'])
        
        return accounts, alerts, names
    
    
    # function call
    accounts, alerts, names = make_dataframes('test.csv')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-12-18
      • 2014-06-07
      • 2023-03-31
      • 2021-04-30
      • 2020-02-16
      • 2017-03-21
      相关资源
      最近更新 更多