将 CSV 转换为分层 JSON 输出答案

【问题标题】：Converting CSV to Hierarchical JSON output将 CSV 转换为分层 JSON 输出
【发布时间】：2021-01-01 20:44:35
【问题描述】：

我正在尝试将 CSV 文件转换为分层 JSON 文件。CSV 文件输入如下，它包含两列基因和疾病。

gene,disease
A1BG,Adenocarcinoma
A1BG,apnea
A1BG,Athritis
A2M,Asthma
A2M,Astrocytoma
A2M,Diabetes
NAT1,polyps
NAT1,lymphoma
NAT1,neoplasms

预期的输出格式应为以下格式

{
     "name": "A1BG",
     "children": [
      {"name": "Adenocarcinoma"},
      {"name": "apnea"},
      {"name": "Athritis"}
      ]
    },

{
     "name": "A2M",
     "children": [
      {"name": "Asthma"},
      {"name": "Astrocytoma"},
      {"name": "Diabetes"}
      ]
    },


{
     "name": "NAT1",
     "children": [
      {"name": "polyps"},
      {"name": "lymphoma"},
      {"name": "neoplasms"}
      ]
    }

我写的python代码如下。让我知道我需要更改哪里以获得所需的输出。

import json
finalList = []
finalDict = {}
grouped = df.groupby(['gene'])

for key, value in grouped:

    dictionary = {}
    dictList = []
    anotherDict = {}

    j = grouped.get_group(key).reset_index(drop=True)
    dictionary['name'] = j.at[0, 'gene']

    for i in j.index:    
        anotherDict['disease'] = j.at[i, 'disease']
        dictList.append(anotherDict)

    dictionary['children'] = dictList
    finalList.append(dictionary)

with open('outputresult3.json', "w") as out:
    json.dump(finalList,out)

【问题讨论】：

标签： python json pandas csv dictionary

【解决方案1】：

使用 DataFrame.groupby 和自定义 lambda 函数，通过 DataFrame.to_dict 将值转换为字典：

L = (df.rename(columns={'disease':'name'})
       .groupby('gene')
       .apply(lambda x: x[['name']].to_dict('records'))
       .reset_index(name='children')
       .rename(columns={'gene':'name'})
       .to_dict('records')
       )
print (L)
[{'name': 'A1BG', 'children': [{'name': 'Adenocarcinoma'},
                               {'name': 'apnea'}, 
                               {'name': 'Athritis'}]}, 
 {'name': 'A2M', 'children': [{'name': 'Asthma'}, 
                              {'name': 'Astrocytoma'}, 
                              {'name': 'Diabetes'}]}, 
 {'name': 'NAT1', 'children': [{'name': 'polyps'},
                               {'name': 'lymphoma'}, 
                               {'name': 'neoplasms'}]}]

with open('outputresult3.json', "w") as out:
    json.dump(L,out)

【讨论】：

【解决方案2】：

import json

json_data = []

# group the data by each unique gene
for gene, data in df.groupby(["gene"]):

    # obtain a list of diseases for the current gene
    diseases = data["disease"].tolist()

    # create a new list of dictionaries to satisfy json requirements
    children = [{"name": disease} for disease in diseases]
    
    entry = {"name": gene, "children": children}
    json_data.append(entry)
    
with open('outputresult3.json', "w") as out:
    json.dump(json_data, out)

【讨论】：