【问题标题】:Convert nested JSON to CSV or table将嵌套的 JSON 转换为 CSV 或表格
【发布时间】:2021-11-04 01:28:19
【问题描述】:

我知道这个问题已经被问过很多次了,但没有一个答案能满足我的要求。我想将 任何嵌套的 JSON 动态转换为 CSV 文件或 Dataframe。一些示例如下:

input : {"menu": {
    "header": "SVG Viewer",
    "items": [
        {"id": "Open"},
        {"id": "OpenNew", "label": "Open New"},
        null,
        {"id": "ZoomIn", "label": "Zoom In"},
        {"id": "ZoomOut", "label": "Zoom Out"},
        {"id": "OriginalView", "label": "Original View"},
        null,
        {"id": "Quality"},
        {"id": "Pause"},
        {"id": "Mute"},
        null,
        {"id": "Find", "label": "Find..."},
        {"id": "FindAgain", "label": "Find Again"},
        {"id": "Copy"},
        {"id": "CopyAgain", "label": "Copy Again"},
        {"id": "CopySVG", "label": "Copy SVG"},
        {"id": "ViewSVG", "label": "View SVG"},
        {"id": "ViewSource", "label": "View Source"},
        {"id": "SaveAs", "label": "Save As"},
        null,
        {"id": "Help"},
        {"id": "About", "label": "About Adobe CVG Viewer..."}
    ]
}}

输出:

input 2 : {"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}
    ]
  }
}}

输出 2:

到目前为止,我已经尝试了下面的代码,它工作正常,但它将列表类型的数据分解为列,但我希望它按行分解。

from pandas.io.json import json_normalize
import pandas as pd


def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            
            for a in x:
                
                flatten(x[a], name + a + '.')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '.')
                i += 1
        else:
            
            out[str(name[:-1])] = str(x)

    flatten(y)
    return out
  
def start_explode(data):
    
  if type(data) is dict: 
    df = pd.DataFrame([flatten_json(data)])
  else:
    df = pd.DataFrame([flatten_json(x) for x in data])
  
  df = df.astype(str)
  return df

complex_json = {"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}
    ]
  }
}}
df = start_explode(complex_json['menu'])
display(df)

它为上述输入之一提供如下输出:

【问题讨论】:

  • 请检查How to Ask。到目前为止,您尝试了哪些方法,您自己无法解决哪些特定问题?
  • 更新了问题@buran

标签: python json pandas dataframe pyspark


【解决方案1】:
  • 处理嵌套 json 的标准技术
    1. json_normalize()
    2. explode()
    3. apply(pd.Series)
  • 最后进行一些清理,删除不需要的行并将nan 替换为空字符串
import json
js = """{"menu": {
    "header": "SVG Viewer",
    "items": [
        {"id": "Open"},
        {"id": "OpenNew", "label": "Open New"},
        null,
        {"id": "ZoomIn", "label": "Zoom In"},
        {"id": "ZoomOut", "label": "Zoom Out"},
        {"id": "OriginalView", "label": "Original View"},
        null,
        {"id": "Quality"},
        {"id": "Pause"},
        {"id": "Mute"},
        null,
        {"id": "Find", "label": "Find..."},
        {"id": "FindAgain", "label": "Find Again"},
        {"id": "Copy"},
        {"id": "CopyAgain", "label": "Copy Again"},
        {"id": "CopySVG", "label": "Copy SVG"},
        {"id": "ViewSVG", "label": "View SVG"},
        {"id": "ViewSource", "label": "View Source"},
        {"id": "SaveAs", "label": "Save As"},
        null,
        {"id": "Help"},
        {"id": "About", "label": "About Adobe CVG Viewer..."}
    ]
}}"""

df = pd.json_normalize(json.loads(js)).explode("menu.items").reset_index(drop=True)
df.drop(columns=["menu.items"]).join(df["menu.items"].apply(pd.Series)).dropna(subset=["id"]).fillna("")

menu.header id label
0 SVG Viewer Open
1 SVG Viewer OpenNew Open New
3 SVG Viewer ZoomIn Zoom In
4 SVG Viewer ZoomOut Zoom Out
5 SVG Viewer OriginalView Original View
7 SVG Viewer Quality
8 SVG Viewer Pause
9 SVG Viewer Mute
11 SVG Viewer Find Find...
12 SVG Viewer FindAgain Find Again
13 SVG Viewer Copy
14 SVG Viewer CopyAgain Copy Again
15 SVG Viewer CopySVG Copy SVG
16 SVG Viewer ViewSVG View SVG
17 SVG Viewer ViewSource View Source
18 SVG Viewer SaveAs Save As
20 SVG Viewer Help
21 SVG Viewer About About Adobe CVG Viewer...

实用功能

  • 如果您不想命名列,而是取第一个列表列
  • 确定包含列表的第一列
  • explode()apply(pd.Series) 到该列
  • 提供了展开所有列表的选项
def normalize(js, expand_all=False):
    df = pd.json_normalize(json.loads(js) if type(js)==str else js)
    # get first column that contains lists
    col = df.applymap(type).astype(str).eq("<class 'list'>").all().idxmax()
    # explode list and expand embedded dictionaries
    df = df.explode(col).reset_index(drop=True)
    df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".{col}")
    # any lists left?
    if expand_all and df.applymap(type).astype(str).eq("<class 'list'>").any(axis=1).all():
        df = normalize(df.to_dict("records"))
    return df

js = """{ "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" } ] }"""

normalize(js, expand_all=True)

id type name ppu id.topping type.topping id.batters.batter type.batters.batter
0 0001 donut Cake 0.55 5001 None 1001 Regular
1 0001 donut Cake 0.55 5001 None 1002 Chocolate
2 0001 donut Cake 0.55 5001 None 1003 Blueberry
3 0001 donut Cake 0.55 5001 None 1004 Devil's Food
4 0001 donut Cake 0.55 5002 Glazed 1001 Regular
5 0001 donut Cake 0.55 5002 Glazed 1002 Chocolate
6 0001 donut Cake 0.55 5002 Glazed 1003 Blueberry
7 0001 donut Cake 0.55 5002 Glazed 1004 Devil's Food
8 0001 donut Cake 0.55 5005 Sugar 1001 Regular
9 0001 donut Cake 0.55 5005 Sugar 1002 Chocolate
10 0001 donut Cake 0.55 5005 Sugar 1003 Blueberry
11 0001 donut Cake 0.55 5005 Sugar 1004 Devil's Food

考虑每个列表独立

def n2(js):
    df = pd.json_normalize(json.loads(js))
    # columns that contain lists
    cols = [i for i, c in df.applymap(type).astype(str).eq("<class 'list'>").all().iteritems() if c]
    # use list from first row
    return pd.concat(
        [df.drop(columns=cols)]
        + [pd.json_normalize(df.loc[0, c]).pipe(lambda d: d.rename(columns={c2: f"{c}.{c2}" for c2 in d.columns}))
            for c in cols],
        axis=1,
    ).fillna("")

【讨论】:

  • 嘿@rob 谢谢你的帮助!但通过这种方式,它将是静态的,即对于每个不同的 json,我必须提供不同的配置,然后它才会起作用。如果你看到我尝试过的,虽然它正在做这项工作,但它正在水平爆炸,但我希望它垂直爆炸,在那里,我只是提供了 json,然后它递归地做了所有的事情......所以任何机会,如果我们可以使它动态?我问的原因是,我不知道在哪个级别,再次它可以遇到一个列表..
  • 更新的答案 - 您不想命名列,但暗示列可以使用提供的实用功能完成。我没有看到任何通用的方法来删除不需要的行并用空字符串替换 nan
  • 嘿伙计!!你是天才,让我再测试几个样本..
  • 回答的小更新 - 需要确保 join() 生成唯一的列名。对于上面的示例,有两个嵌入式列表,您可以通过调用两次来扩展:normalize(normalize(js).to_dict("records"))
  • 出于兴趣,我已将其编码并添加到答案中。我认为这是一种非常危险且限制性的数据管理方法
【解决方案2】:

你可以试试json_normalize

import pandas as pd
import json

data = json.loads("""{"menu": {
    "header": "SVG Viewer",
    "items": [
        {"id": "Open"},
        {"id": "OpenNew", "label": "Open New"},
        null,
        {"id": "ZoomIn", "label": "Zoom In"},
        {"id": "ZoomOut", "label": "Zoom Out"},
        {"id": "OriginalView", "label": "Original View"},
        null,
        {"id": "Quality"},
        {"id": "Pause"},
        {"id": "Mute"},
        null,
        {"id": "Find", "label": "Find..."},
        {"id": "FindAgain", "label": "Find Again"},
        {"id": "Copy"},
        {"id": "CopyAgain", "label": "Copy Again"},
        {"id": "CopySVG", "label": "Copy SVG"},
        {"id": "ViewSVG", "label": "View SVG"},
        {"id": "ViewSource", "label": "View Source"},
        {"id": "SaveAs", "label": "Save As"},
        null,
        {"id": "Help"},
        {"id": "About", "label": "About Adobe CVG Viewer..."}
    ]
}}""")

# remove null
data['menu']['items'] = [i for i in data['menu']['items'] if i is not None]

pd.json_normalize(data['menu'], record_path=['items'], meta=['header'], record_prefix='items_')

#   items_id    items_label
# header        
# SVG Viewer    Open    NaN
# SVG Viewer    OpenNew Open New
# SVG Viewer    ZoomIn  Zoom In
# SVG Viewer    ZoomOut Zoom Out
# SVG Viewer    OriginalView    Original View
# SVG Viewer    Quality NaN
# SVG Viewer    Pause   NaN
# SVG Viewer    Mute    NaN
# SVG Viewer    Find    Find...
# SVG Viewer    FindAgain   Find Again
# SVG Viewer    Copy    NaN
# SVG Viewer    CopyAgain   Copy Again
# SVG Viewer    CopySVG Copy SVG
# SVG Viewer    ViewSVG View SVG
# SVG Viewer    ViewSource  View Source
# SVG Viewer    SaveAs  Save As
# SVG Viewer    Help    NaN
# SVG Viewer    About   About Adobe CVG Viewer...

【讨论】:

  • 感谢@Epsi95,非常感谢您的帮助!但通过这种方式,它将是静态的,即对于每个不同的 json,我必须提供不同的配置,然后它才会起作用。如果你看到我尝试过的东西,虽然它正在做这项工作,但它正在水平爆炸,但我希望它垂直爆炸,在那里,我只是提供了 json,然后它递归地做了所有的事情..
【解决方案3】:

在 python 中,您可以使用 pandas 来执行此操作,但它会重复每一行的标题值,如下所示


code

output

【讨论】:

  • 感谢@rao-yasir,但我希望它在每个级别都爆炸,所以你的输出仍然有 dict 值。
  • @mayankgupta 是的,然后需要编写自定义解析器让我尝试一下,如果我成功了会发布
  • @rao-yasir 欢迎来到 SO。请将您的代码和输出作为(格式化)文本发布,而不是作为图像链接。链接可能被破坏,图像不能在编辑器中复制/粘贴
  • @gimix 我一定会的
猜你喜欢
  • 1970-01-01
  • 2018-01-07
  • 2020-10-28
  • 1970-01-01
  • 2020-04-02
  • 2018-10-27
  • 2021-05-17
  • 2021-01-08
相关资源
最近更新 更多