将嵌套的 JSON 字符串转换为 Pandas 数据帧（并添加“外键”以关联它们）答案

【问题标题】：Convert a nested JSON string to Pandas dataframes (and add "foreign key" to relate them)将嵌套的 JSON 字符串转换为 Pandas 数据帧（并添加“外键”以关联它们）
【发布时间】：2020-11-05 10:57:50
【问题描述】：

我有一个 CSV 文件，其中包含以下列：

customer_id：正是如此。
report_date：报告的创建日期。
json_report：一个 JSON 对象

JSON 对象是这样的：

{
    "Person": {
         "Name": {
              "FirstName": "John",
              "LastName": "Doe"
          },
          "Accounts": {
              "Account": [
                   {"AccountNumber":123, "AccountStatus": "G"},
                   {"AccountNumber":137, "AccountStatus": "B"},
                   {"AccountNumber":593, "AccountStatus": "VB"}
              ]
          },
          "Alerts": {
              "Alert": [
                  {"DT":"20200601", "Msg": "Lorem ipsum"},
                  {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}
              ]
          }
    }
}

如您所见，该对象中有嵌套的 JSON 对象和列表。此外，原始 CSV 文件中的其他行可能在 JSON 文件中包含更多元素。

我需要的是创建可以相互关联的 Pandas 数据框。按照上面的例子，我需要以下数据框：

Name，有列：
- customer_id
- `report_date'
- FirstName
- LastName
Accounts，有列：
- customer_id
- `report_date'
- AccountNumber
- AccountStatus
Alerts
- customer_id
- `report_date'
- DT
- Msg
- Msg2

到目前为止，我一直在手动处理这个问题，识别 JSON 对象中的嵌套对象并相应地处理它们；但是，我知道在未来的某个时候，这将是不可持续的。

所以，我的问题是：有没有办法自动完成这项任务？

到目前为止我一直在做什么：

我将 CSV 文件作为 Pandas 数据框读取
我遍历每一行，读取customer_id、report_date 和json_report
我将 JSON 报告转换为字典
我得到了相关的嵌套对象。
- 如果嵌套对象是字典（例如Name），我添加customer_id 和report_date 键值对，并将编辑后的字典添加到列表中（例如lst_names）
- 如果嵌套对象是一个列表（例如Accounts/Account，我将customer_id 和report_date 键值对添加到每个嵌套字典中，然后我将每个字典添加到列表中（例如lst_accounts） .
我将每个列表都转换为 Pandas 数据框

拥有多个数据帧很重要，因为我需要每个数据帧执行不同的任务（即，如果可能，我不想使用json_normalize）。

【问题讨论】：

标签： python json pandas dataframe json-normalize

【解决方案1】：

最简单的做法是读取 csv 并使用额外信息更新 JSON 数据。
可以删除许多不必要的复杂性，以便更轻松地处理 JSON。

更新 JSON 数据

import csv
import pandas as pd
from ast import literal_eval

# read in the csv file
with open('test.csv', 'r') as f:
    data = list(csv.reader(f, delimiter=';'))

# alter the json and create a list of only the json, which now contains all the information
new_json = list()
for i, (idx, date, json) in enumerate(data):
    if i > 0:
        json = literal_eval(json)  # convert the str to a dict
        json['id'] = idx  # add unique id
        json['date'] = date  # add report date
        json['Accounts'] = json['Person']['Accounts']['Account']  # move list to top level key
        json['Alerts'] = json['Person']['Alerts']['Alert']  # move list to top level key
        json['first_name'] = json['Person']['Name']['FirstName']   # move value to top level key
        json['last_name'] = json['Person']['Name']['LastName']   # move value to top level key
        json.pop('Person')  # remove because it's no longer needed
        new_json.append(json)  # append to list

# print(new_json[0])
{'Accounts': [{'AccountNumber': 123, 'AccountStatus': 'G'},
              {'AccountNumber': 137, 'AccountStatus': 'B'},
              {'AccountNumber': 593, 'AccountStatus': 'VB'}],
 'Alerts': [{'DT': '20200601', 'Msg': 'Lorem ipsum'},
            {'DT': '20200615', 'Msg': 'Dolor sit amet', 'Msg2': 'Lorem'}],
 'date': '20200601',
 'first_name': 'John1',
 'id': '123',
 'last_name': 'Doe1'}

创建单独的数据框

# create accounts
accounts = pd.json_normalize(new_json, ['Accounts'], ['id', 'date'])

# display(accounts.head())
   AccountNumber AccountStatus   id      date
0            123             G  123  20200601
1            137             B  123  20200601
2            593            VB  123  20200601
3            123             G  456  20200602
4            137             B  456  20200602

# create alerts
alerts = pd.json_normalize(new_json, ['Alerts'], ['id', 'date'])

# display(alerts.head())
         DT             Msg   Msg2   id      date
0  20200601     Lorem ipsum    NaN  123  20200601
1  20200615  Dolor sit amet  Lorem  123  20200601
2  20200601     Lorem ipsum    NaN  456  20200602
3  20200615  Dolor sit amet  Lorem  456  20200602
4  20200601     Lorem ipsum    NaN  789  20200603

# create name
name = pd.json_normalize(new_json).drop(columns=['Accounts', 'Alerts'])

# display(name)
    id      date first_name last_name
0  123  20200601      John1      Doe1
1  456  20200602      John2      Doe2
2  789  20200603      John3      Doe3
3  123  20200606      John1      Doe1

`test.csv`中使用的数据：

id;date;json
123;20200601;{"Person": {"Name": {"FirstName": "John1", "LastName": "Doe1"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
456;20200602;{"Person": {"Name": {"FirstName": "John2", "LastName": "Doe2"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
789;20200603;{"Person": {"Name": {"FirstName": "John3", "LastName": "Doe3"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}
123;20200606;{"Person": {"Name": {"FirstName": "John1", "LastName": "Doe1"}, "Accounts": {"Account": [{"AccountNumber":123, "AccountStatus": "G"}, {"AccountNumber":137, "AccountStatus": "B"}, {"AccountNumber":593, "AccountStatus": "VB"}]}, "Alerts": {"Alert": [{"DT":"20200601", "Msg": "Lorem ipsum"}, {"DT":"20200615", "Msg": "Dolor sit amet", "Msg2": "Lorem"}]}}}

作为一个函数

from typing import List, Tuple  # used for type hints
import csv
import pandas as pd
from ast import literal_eval


def fix_json(data: List[List[str]]) -> List[dict]:
    new_json = list()
    for i, (idx, date, json) in enumerate(data):
        if i > 0:
            json = literal_eval(json)
            json['id'] = idx
            json['date'] = date
            json['Accounts'] = json['Person']['Accounts']['Account']
            json['Alerts'] = json['Person']['Alerts']['Alert']
            json['first_name'] = json['Person']['Name']['FirstName']
            json['last_name'] = json['Person']['Name']['LastName']
            json.pop('Person')
            new_json.append(json)
            
    return new_json


def make_dataframes(file_path_name: str) -> Tuple[pd.DataFrame]:
    with open(file_path_name, 'r') as f:
        data = list(csv.reader(f, delimiter=';'))
        
    new_json = fix_json(data)
    
    accounts = pd.json_normalize(new_json, ['Accounts'], ['id', 'date'])
    alerts = pd.json_normalize(new_json, ['Alerts'], ['id', 'date'])
    names = pd.json_normalize(new_json).drop(columns=['Accounts', 'Alerts'])
    
    return accounts, alerts, names


# function call
accounts, alerts, names = make_dataframes('test.csv')

【讨论】：

更新 JSON 数据

创建单独的数据框

test.csv中使用的数据：

作为一个函数

`test.csv`中使用的数据：