将包含字典列表的 pandas 数据框列解压缩到新列中答案

【问题标题】：Unpacking a pandas dataframe column that contains a list of dictionaries into new columns将包含字典列表的 pandas 数据框列解压缩到新列中
【发布时间】：2021-07-10 14:32:40
【问题描述】：

我有一个数据框new_df，它有一列，其中包含一个字典列表，还有一些行NaN。

new_df
                                                            0
    0                                                 NaN
    1                                                 NaN
    2   [{'start_time': '09:16:44', 'e...
    3   [{'start_time': '09:36:44', 'e...
    4   [{'start_time': '09:46:44', 'e...
    5   [{'start_time': '09:48:44', 'e...
    6   [{'start_time': '09:55:44', 'e...
    7   [{'start_time': '09:59:44', 'e...
    8   [{'start_time': '10:50:22', 'e...
    9   [{'start_time': '11:30:22', 'e...
    10  [{'start_time': '11:35:22', 'e...
    11  [{'start_time': '12:50:22', 'e...
    12                                                NaN
    13                                                NaN

当一行包含一个包含字典的列表时，它采用以下格式：

[{'start_time': '09:16:44', 'end_time': '9:36:44', 'job_id': '123456'}]

我需要将new_df 中每个列表/行中的字典解包到新列中，并将这些新列应用到另一个数据框。

我遇到的问题是保留new_df 的索引，因为需要将新列数据正确应用到其他数据帧。

我可以解压列表并从字典值创建新列，但是当我应用新列时，它们适用于 row[0] 而不是 row[2] 在这种情况下。我丢失了行值为NaN 的开头和结尾的行。

add_df = pd.DataFrame(list(new_df[0]))

生产

  start_time   end_time   job_id  
0  09:16:44  09:36:44     123456
1  09:36:44  09:46:44     123457
2  09:46:44  09:48:44     123458
3  09:48:44  09:59:59     123459
      ...      ...          ...
8  11:35:22  12:45:00     123460
9  12:50:22  13:00:00     123461

我需要保留如下所示的索引，来自new_df 的索引包含字典列表：

      start_time   end_time   job_id  
    0    NaN        NaN         NaN
    1    NaN        NaN         NaN
    2  09:16:44  09:36:44     123456
    3  09:36:44  09:46:44     123457
    4  09:46:44  09:48:44     123458
    5  09:48:44  09:59:59     123459
          ...      ...          ...
   10  11:35:22  12:45:00     123460
   11  12:50:22  13:00:00     123461
   12    NaN        NaN         NaN
   13    NaN        NaN         NaN

如何将索引保留到 NaN 行的前导和尾随行？

【问题讨论】：

您能否在问题中添加df.head().to_dict()，以便我们查看您数据的确切格式。否则，如果您创建 add_df 的方式适合您，您可以指定索引，如 pd.DataFrame(list(df[0]), index=df.index[df[0].notna()])
我已经更新了帖子。你的评论让我意识到我不清楚我想要的最终结果是什么。我试过你的建议，但不幸的是它抛出了ValueError: Shape of passed values is (14, 1), indices imply (10, 1)
我仍然无法重现与您完全相同的行为，但请尝试使用 dropna，然后使用原始索引 pd.DataFrame(list(df[0].dropna()), index=df.index[df[0].notna()]).reindex(df.index) 重新索引

标签： python pandas list dataframe dictionary

【解决方案1】：

@Ben.T 的评论让我想到了我想要完成的事情。

我正在从一个字典列表系列创建一个数据框。当我可以将新数据框应用于列轴上的现有数据框时，为什么我要逐列剥离这个新数据框？

我的解决方案：

# Creates df but removes the NaN elements
new_df = pd.DataFrame(list(orig_df[0]).dropna())   

# Get the orig_df indexes of non-NaN rows to apply to the new df
new_ndx = new_df.index[orig_df[0].notna()]

# Reset index and give new indexes that will line up
new_df = new_df.reset_index(drop=True)
new_df = new_df.set_index(new_ndx)

# Now apply the new_df to the orig_df
orig_df= pd.concat([orig_df, new_df ], axis=1)

是否有更 Pythonic 的方式来完成这个...？

【讨论】：