如何从包含嵌套字典的字典创建 Pandas 数据框？答案

【问题标题】：How do I create a Pandas Dataframe from a dictionary containing a nested dictionary?如何从包含嵌套字典的字典创建 Pandas 数据框？
【发布时间】：2021-12-08 15:23:16
【问题描述】：

我正在从事一个项目，我从 GraphQL API 获取 JSON 数据。收到数据后，我在数据上使用 json.loads()，然后访问我需要的部分 JSON，然后将其存储在包含另一个字典的字典中。字典是：

{'placement': 1, 'entrant': {'id': 8554498, 'name': 'Test'}}
{'placement': 2, 'entrant': {'id': 8559863, 'name': 'Test'}}
{'placement': 3, 'entrant': {'id': 8561463, 'name': 'Test'}}
{'placement': 4, 'entrant': {'id': 8559889, 'name': 'Test'}}
{'placement': 5, 'entrant': {'id': 8561608, 'name': 'Test'}}
{'placement': 5, 'entrant': {'id': 8560090, 'name': 'Test'}}
{'placement': 7, 'entrant': {'id': 8561639, 'name': 'Test'}}
{'placement': 7, 'entrant': {'id': 8561822, 'name': 'Test'}}
{'placement': 9, 'entrant': {'id': 8559993, 'name': 'Test'}}
{'placement': 9, 'entrant': {'id': 8561572, 'name': 'Test'}}

我怎样才能创建一个 Pandas 数据框，以便列是

placement |  id  |  name

这些列下面的值是字典中与它们关联的值吗？如果我只使用

pd.DataFrame()

输出不如预期，因此我尝试查找涉及我迭代字典中的项目的解决方案，但我没有成功。任何帮助，将不胜感激。谢谢。

【问题讨论】：

共享数据是否包含在列表中？请提供完整的示例表格，以及预期的输出
我也想问同样的问题——行中是否有括号或逗号，df_data1 的类型和长度是什么
它没有包含在列表中。这与我的帖子中显示的完全一样。 df_data1 的类型，每个 type()，是一个 .

标签： python json pandas dataframe dictionary

【解决方案1】：

这是一种方法，方法是从第一个 DataFrame 中提取一个新的 DataFrame 并将其合并：

from itertools import chain

import pandas as pd

data = [
    [{"placement": 1, "entrant": {"id": 8554498, "name": "Test"}}],
    [{"placement": 2, "entrant": {"id": 8559863, "name": "Test"}}],
    [{"placement": 3, "entrant": {"id": 8561463, "name": "Test"}}],
    [{"placement": 4, "entrant": {"id": 8559889, "name": "Test"}}],
    [{"placement": 5, "entrant": {"id": 8561608, "name": "Test"}}],
    [{"placement": 5, "entrant": {"id": 8560090, "name": "Test"}}],
    [{"placement": 7, "entrant": {"id": 8561639, "name": "Test"}}],
    [{"placement": 7, "entrant": {"id": 8561822, "name": "Test"}}],
    [{"placement": 9, "entrant": {"id": 8559993, "name": "Test"}}],
    [{"placement": 9, "entrant": {"id": 8561572, "name": "Test"}}],
]

df = pd.DataFrame.from_dict(chain(*data))
result_df = pd.merge_asof(
    df.loc[:, df.columns != "entrant"],  # Get df without the "entrant" column
    df["entrant"].apply(pd.Series), left_index=True, right_index=True
)

结果如下：

   placement       id  name
0          1  8554498  Test
1          2  8559863  Test
2          3  8561463  Test
3          4  8559889  Test
4          5  8561608  Test
5          5  8560090  Test
6          7  8561639  Test
7          7  8561822  Test
8          9  8559993  Test
9          9  8561572  Test

【讨论】：

这在我用您的数据测试时有效，但在用我的数据测试时它不起作用，因为我的 df_data1 变量没有存储在一个整体列表中。当我尝试将其存储在列表中时，它会像这样存储它：pastebin.com/Bbe9jDjs。它认为 df_data1 字典只是列表中的一个条目。
@binarycoffee356 我明白了，我使用 itertools（内置模块）编辑了我的答案以支持嵌套列表。
我在“参赛者”上遇到一个关键错误。我怀疑这是因为您的数据是一个带有逗号分隔字典的数组，而我的只是一个包含一个条目的列表。
实际上，即使没有编辑，它也能完美运行。这只是我的一个小问题。非常感谢您的帮助。

【解决方案2】：

您需要为 pandas 创建适当的字典才能创建数据框。我在这里假设您有一个称为字典的字典列表。

pd.DataFrame(
    [
        {"placement": d["placement"], "id": d["entrant"]["id"], "name": d["entrant"]["name"]}
        for d in dictionaries
    ]
)

【讨论】：

好吧，我没有字典列表。我创建和存储字典的方式是在循环中执行以下操作：df_data1 = json_data['data']['event']['standings']['nodes'][j]，其中 json_data 是保存 JSON 数据的变量和 df_data1 成为我帖子中的字典。使用您的解决方案，我收到一个错误，即字符串索引必须是整数。
那么，所有行都是包含字典的唯一字典，还是每一行都是单独的字典？
好吧，我写 Python 已经有一段时间了，所以我的回答可能并不完全正确，但从我在帖子中发布的字典来看，它看起来像是一本包含每行都有一个唯一的字典。我发布的字典正是我打印其内容时所看到的。如果我打印字典的项目，我可以看到它们是 'placement' 和 'entrant' 如果这有助于更好地理解它的结构。

【解决方案3】：

我会做这样的事情，也许不是最优雅的解决方案，但它确实有效。我假设您有每个字典的列表，因为您将字典逐个按比例分配

dList = [{'placement': 1, 'entrant': {'id': 8554498, 'name': 'Test'}},
{'placement': 2, 'entrant': {'id': 8559863, 'name': 'Test'}},
{'placement': 3, 'entrant': {'id': 8561463, 'name': 'Test'}},
{'placement': 4, 'entrant': {'id': 8559889, 'name': 'Test'}},
{'placement': 5, 'entrant': {'id': 8561608, 'name': 'Test'}},
{'placement': 5, 'entrant': {'id': 8560090, 'name': 'Test'}},
{'placement': 7, 'entrant': {'id': 8561639, 'name': 'Test'}},
{'placement': 7, 'entrant': {'id': 8561822, 'name': 'Test'}},
{'placement': 9, 'entrant': {'id': 8559993, 'name': 'Test'}},
{'placement': 9, 'entrant': {'id': 8561572, 'name': 'Test'}}]


#generate column Names I supose that you dont have writed this names to make more general the problem
d0 = dList[0]
columns = []
for key,val in d0.items():
    if not isinstance(val,dict):
        columns.append(key)
    else:
        for subkey,subval in val.items():
            columns.append(subkey)

#%% Here we are going to generate de data list (a list with a sublist for every dict
data = []
for d in dList:
    thisData = []
    for key,val in d.items():
        if not isinstance(val,dict):
            thisData.append(val)
        else:
            for subkey,subval in val.items():
                thisData.append(subval)
    data.append(thisData)


df = pd.DataFrame(data,columns=columns)

希望它对你有用，如果没有，请告诉我

【讨论】：

如果我的字典没有存储在列表中怎么办？它存储在变量 df_data1 中，并通过在循环中执行以下操作来分配：df_data1 = json_data1['data']['event']['standings']['nodes'][j].
对不起，我不太明白，你有多个字典，分别叫 df_data1、df_data2 等等？
不，很抱歉造成混乱。我循环的原因是因为 JSON 的“节点”部分是一个长度为 n 的数组。我正在循环访问每个节点的所有数据，但我只是将它存储在一个变量 df_data1 中，然后它成为我原始帖子中的字典。
因此，如果您有一本包含ig 字典（还包含一个以上字典）的字典，我们称它们为 orig_dict（包含字典的字典） placeDict（{'placement': 1, 'entrant': {'id': 8554498, 'name': 'Test'}} ）和 subdict （参赛者）然后替换 for d in dList: by for d in orig_dict.values(): this migth work leme know
这是一个不错的解决方案，但在优化方面，由于 python 循环，它会非常糟糕。作为 pandas 和其他面向数据科学的工具的一般经验法则：您必须做的纯 Python 处理越少越好。

【解决方案4】：

由于 json.loads() 提供数据的方式，您无法遍历 df_data1 捕获所有字典。为了根据需要修复字典的结构，我建议您执行以下操作，通过将任何出现的 "}{" 替换为 "}, {"，在字典之间添加逗号，并用 "[" 和 "]" 将其包围。假设j你的json字符串，那么：

df_data1 = json.loads("[" + j.replace("}{", "}, {") + "]")

现在您的 df_data1 应该如下所示：

[{'placement': 1, 'entrant': {'id': 8554498, 'name': 'Test'}},
{'placement': 2, 'entrant': {'id': 8559863, 'name': 'Test'}},
{'placement': 3, 'entrant': {'id': 8561463, 'name': 'Test'}},
{'placement': 4, 'entrant': {'id': 8559889, 'name': 'Test'}},
{'placement': 5, 'entrant': {'id': 8561608, 'name': 'Test'}},
{'placement': 5, 'entrant': {'id': 8560090, 'name': 'Test'}},
{'placement': 7, 'entrant': {'id': 8561639, 'name': 'Test'}},
{'placement': 7, 'entrant': {'id': 8561822, 'name': 'Test'}},
{'placement': 9, 'entrant': {'id': 8559993, 'name': 'Test'}},
{'placement': 9, 'entrant': {'id': 8561572, 'name': 'Test'}}]

现在您可以使用@Thomas Q 解决方案：

df= pd.DataFrame([
        {"placement": d["placement"], "id": d["entrant"]["id"], "name": d["entrant"]["name"]}
        for d in df_data1
        ])
df
    placement   id  name
0   1   8554498 Test
1   2   8559863 Test
2   3   8561463 Test
3   4   8559889 Test
4   5   8561608 Test
5   5   8560090 Test
6   7   8561639 Test
7   7   8561822 Test
8   9   8559993 Test
9   9   8561572 Test

【讨论】：