将嵌套的 JSON 转换为 CSV 或表格答案

【问题标题】：Convert nested JSON to CSV or table将嵌套的 JSON 转换为 CSV 或表格
【发布时间】：2021-11-04 01:28:19
【问题描述】：

我知道这个问题已经被问过很多次了，但没有一个答案能满足我的要求。我想将 任何嵌套的 JSON 动态转换为 CSV 文件或 Dataframe。一些示例如下：

input : {"menu": {
    "header": "SVG Viewer",
    "items": [
        {"id": "Open"},
        {"id": "OpenNew", "label": "Open New"},
        null,
        {"id": "ZoomIn", "label": "Zoom In"},
        {"id": "ZoomOut", "label": "Zoom Out"},
        {"id": "OriginalView", "label": "Original View"},
        null,
        {"id": "Quality"},
        {"id": "Pause"},
        {"id": "Mute"},
        null,
        {"id": "Find", "label": "Find..."},
        {"id": "FindAgain", "label": "Find Again"},
        {"id": "Copy"},
        {"id": "CopyAgain", "label": "Copy Again"},
        {"id": "CopySVG", "label": "Copy SVG"},
        {"id": "ViewSVG", "label": "View SVG"},
        {"id": "ViewSource", "label": "View Source"},
        {"id": "SaveAs", "label": "Save As"},
        null,
        {"id": "Help"},
        {"id": "About", "label": "About Adobe CVG Viewer..."}
    ]
}}

输出：

input 2 : {"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}
    ]
  }
}}

输出 2：

到目前为止，我已经尝试了下面的代码，它工作正常，但它将列表类型的数据分解为列，但我希望它按行分解。

from pandas.io.json import json_normalize
import pandas as pd


def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            
            for a in x:
                
                flatten(x[a], name + a + '.')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '.')
                i += 1
        else:
            
            out[str(name[:-1])] = str(x)

    flatten(y)
    return out
  
def start_explode(data):
    
  if type(data) is dict: 
    df = pd.DataFrame([flatten_json(data)])
  else:
    df = pd.DataFrame([flatten_json(x) for x in data])
  
  df = df.astype(str)
  return df

complex_json = {"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}
    ]
  }
}}
df = start_explode(complex_json['menu'])
display(df)

它为上述输入之一提供如下输出：

【问题讨论】：

请检查How to Ask。到目前为止，您尝试了哪些方法，您自己无法解决哪些特定问题？
更新了问题@buran

标签： python json pandas dataframe pyspark

【解决方案1】：

处理嵌套 json 的标准技术
1. json_normalize()
2. explode()
3. apply(pd.Series)
最后进行一些清理，删除不需要的行并将nan 替换为空字符串

import json
js = """{"menu": {
    "header": "SVG Viewer",
    "items": [
        {"id": "Open"},
        {"id": "OpenNew", "label": "Open New"},
        null,
        {"id": "ZoomIn", "label": "Zoom In"},
        {"id": "ZoomOut", "label": "Zoom Out"},
        {"id": "OriginalView", "label": "Original View"},
        null,
        {"id": "Quality"},
        {"id": "Pause"},
        {"id": "Mute"},
        null,
        {"id": "Find", "label": "Find..."},
        {"id": "FindAgain", "label": "Find Again"},
        {"id": "Copy"},
        {"id": "CopyAgain", "label": "Copy Again"},
        {"id": "CopySVG", "label": "Copy SVG"},
        {"id": "ViewSVG", "label": "View SVG"},
        {"id": "ViewSource", "label": "View Source"},
        {"id": "SaveAs", "label": "Save As"},
        null,
        {"id": "Help"},
        {"id": "About", "label": "About Adobe CVG Viewer..."}
    ]
}}"""

df = pd.json_normalize(json.loads(js)).explode("menu.items").reset_index(drop=True)
df.drop(columns=["menu.items"]).join(df["menu.items"].apply(pd.Series)).dropna(subset=["id"]).fillna("")

	menu.header	id	label
0	SVG Viewer	Open
1	SVG Viewer	OpenNew	Open New
3	SVG Viewer	ZoomIn	Zoom In
4	SVG Viewer	ZoomOut	Zoom Out
5	SVG Viewer	OriginalView	Original View
7	SVG Viewer	Quality
8	SVG Viewer	Pause
9	SVG Viewer	Mute
11	SVG Viewer	Find	Find...
12	SVG Viewer	FindAgain	Find Again
13	SVG Viewer	Copy
14	SVG Viewer	CopyAgain	Copy Again
15	SVG Viewer	CopySVG	Copy SVG
16	SVG Viewer	ViewSVG	View SVG
17	SVG Viewer	ViewSource	View Source
18	SVG Viewer	SaveAs	Save As
20	SVG Viewer	Help
21	SVG Viewer	About	About Adobe CVG Viewer...

实用功能

如果您不想命名列，而是取第一个列表列
确定包含列表的第一列
explode() 和 apply(pd.Series) 到该列
提供了展开所有列表的选项

def normalize(js, expand_all=False):
    df = pd.json_normalize(json.loads(js) if type(js)==str else js)
    # get first column that contains lists
    col = df.applymap(type).astype(str).eq("<class 'list'>").all().idxmax()
    # explode list and expand embedded dictionaries
    df = df.explode(col).reset_index(drop=True)
    df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".{col}")
    # any lists left?
    if expand_all and df.applymap(type).astype(str).eq("<class 'list'>").any(axis=1).all():
        df = normalize(df.to_dict("records"))
    return df

js = """{ "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" } ] }"""

normalize(js, expand_all=True)

	id	type	name	ppu	id.topping	type.topping	id.batters.batter	type.batters.batter
0	0001	donut	Cake	0.55	5001	None	1001	Regular
1	0001	donut	Cake	0.55	5001	None	1002	Chocolate
2	0001	donut	Cake	0.55	5001	None	1003	Blueberry
3	0001	donut	Cake	0.55	5001	None	1004	Devil's Food
4	0001	donut	Cake	0.55	5002	Glazed	1001	Regular
5	0001	donut	Cake	0.55	5002	Glazed	1002	Chocolate
6	0001	donut	Cake	0.55	5002	Glazed	1003	Blueberry
7	0001	donut	Cake	0.55	5002	Glazed	1004	Devil's Food
8	0001	donut	Cake	0.55	5005	Sugar	1001	Regular
9	0001	donut	Cake	0.55	5005	Sugar	1002	Chocolate
10	0001	donut	Cake	0.55	5005	Sugar	1003	Blueberry
11	0001	donut	Cake	0.55	5005	Sugar	1004	Devil's Food

考虑每个列表独立

复制这种工作方式https://data.page/json/csv
这是一个有限的用例，它不遵循一般数据建模原则

def n2(js):
    df = pd.json_normalize(json.loads(js))
    # columns that contain lists
    cols = [i for i, c in df.applymap(type).astype(str).eq("<class 'list'>").all().iteritems() if c]
    # use list from first row
    return pd.concat(
        [df.drop(columns=cols)]
        + [pd.json_normalize(df.loc[0, c]).pipe(lambda d: d.rename(columns={c2: f"{c}.{c2}" for c2 in d.columns}))
            for c in cols],
        axis=1,
    ).fillna("")

【讨论】：

嘿@rob 谢谢你的帮助！但通过这种方式，它将是静态的，即对于每个不同的 json，我必须提供不同的配置，然后它才会起作用。如果你看到我尝试过的，虽然它正在做这项工作，但它正在水平爆炸，但我希望它垂直爆炸，在那里，我只是提供了 json，然后它递归地做了所有的事情......所以任何机会，如果我们可以使它动态？我问的原因是，我不知道在哪个级别，再次它可以遇到一个列表..
更新的答案 - 您不想命名列，但暗示列可以使用提供的实用功能完成。我没有看到任何通用的方法来删除不需要的行并用空字符串替换 nan
嘿伙计！！你是天才，让我再测试几个样本..
回答的小更新 - 需要确保 join() 生成唯一的列名。对于上面的示例，有两个嵌入式列表，您可以通过调用两次来扩展：normalize(normalize(js).to_dict("records"))
出于兴趣，我已将其编码并添加到答案中。我认为这是一种非常危险且限制性的数据管理方法

【解决方案2】：

你可以试试json_normalize

import pandas as pd
import json

data = json.loads("""{"menu": {
    "header": "SVG Viewer",
    "items": [
        {"id": "Open"},
        {"id": "OpenNew", "label": "Open New"},
        null,
        {"id": "ZoomIn", "label": "Zoom In"},
        {"id": "ZoomOut", "label": "Zoom Out"},
        {"id": "OriginalView", "label": "Original View"},
        null,
        {"id": "Quality"},
        {"id": "Pause"},
        {"id": "Mute"},
        null,
        {"id": "Find", "label": "Find..."},
        {"id": "FindAgain", "label": "Find Again"},
        {"id": "Copy"},
        {"id": "CopyAgain", "label": "Copy Again"},
        {"id": "CopySVG", "label": "Copy SVG"},
        {"id": "ViewSVG", "label": "View SVG"},
        {"id": "ViewSource", "label": "View Source"},
        {"id": "SaveAs", "label": "Save As"},
        null,
        {"id": "Help"},
        {"id": "About", "label": "About Adobe CVG Viewer..."}
    ]
}}""")

# remove null
data['menu']['items'] = [i for i in data['menu']['items'] if i is not None]

pd.json_normalize(data['menu'], record_path=['items'], meta=['header'], record_prefix='items_')

#   items_id    items_label
# header        
# SVG Viewer    Open    NaN
# SVG Viewer    OpenNew Open New
# SVG Viewer    ZoomIn  Zoom In
# SVG Viewer    ZoomOut Zoom Out
# SVG Viewer    OriginalView    Original View
# SVG Viewer    Quality NaN
# SVG Viewer    Pause   NaN
# SVG Viewer    Mute    NaN
# SVG Viewer    Find    Find...
# SVG Viewer    FindAgain   Find Again
# SVG Viewer    Copy    NaN
# SVG Viewer    CopyAgain   Copy Again
# SVG Viewer    CopySVG Copy SVG
# SVG Viewer    ViewSVG View SVG
# SVG Viewer    ViewSource  View Source
# SVG Viewer    SaveAs  Save As
# SVG Viewer    Help    NaN
# SVG Viewer    About   About Adobe CVG Viewer...

【讨论】：

感谢@Epsi95，非常感谢您的帮助！但通过这种方式，它将是静态的，即对于每个不同的 json，我必须提供不同的配置，然后它才会起作用。如果你看到我尝试过的东西，虽然它正在做这项工作，但它正在水平爆炸，但我希望它垂直爆炸，在那里，我只是提供了 json，然后它递归地做了所有的事情..

【解决方案3】：

在 python 中，您可以使用 pandas 来执行此操作，但它会重复每一行的标题值，如下所示

code

output

【讨论】：

感谢@rao-yasir，但我希望它在每个级别都爆炸，所以你的输出仍然有 dict 值。
@mayankgupta 是的，然后需要编写自定义解析器让我尝试一下，如果我成功了会发布
@rao-yasir 欢迎来到 SO。请将您的代码和输出作为（格式化）文本发布，而不是作为图像链接。链接可能被破坏，图像不能在编辑器中复制/粘贴
@gimix 我一定会的