使用 Pandas 从复杂的字典/列表中创建 DataFrame答案

【问题标题】：Using Pandas to create DataFrame out of complicated dictionary / list使用 Pandas 从复杂的字典/列表中创建 DataFrame
【发布时间】：2020-01-22 18:24:30
【问题描述】：

我有一个这样的字典列表：

dictionary = [{
    'vendor': 'vendor1',
    'option_list': [{
        'col1_name': 'Column1',
        'col1_options': ['option1', 'option2', 'option3']
        }, {
        'col2_name': 'Column2',
        'col2_options': ['small']
        },  {
        'col3_name': 'Column3',
        'col3_options': ['yellow', 'black', 'green']
        }
    ]
},  {
    'vendor': 'vendor2',
    'option_list': [{
        'col1_name': 'Column1',
        'col1_options': ['option3']
        }, {
        'col2_name': 'Column2',
        'col2_options': ['small', 'medium', 'large']
        }, {
        'col3_name': 'Column3',
        'col3_options': ['yellow', 'green']
        }
    ]
}]

我想把它变成这样的熊猫数据框：

Vendor    Column1    Column2    Column3
vendor1   option1    small      yellow
vendor1   option2    NaN        black
vendor1   option3    NaN        green
vendor2   option3    small      yellow
vendor2   NaN        medium     green
vendor2   NaN        large      NaN

问题是，我不知道我会得到多少供应商和专栏。此外，如上例所示，一些插入的数据可以是 NaN。

有没有办法使用 pandas 从这种字典中创建数据框？

不胜感激！

【问题讨论】：

标签： python python-3.x pandas dataframe dictionary

【解决方案1】：

在纯 python 中对其进行处理，并使用一些 pandas 进行最终调整

a = [[x['vendor'], vals[f'col{i+1}_options']] for x in d \
                                              for (i,vals) in enumerate(x['option_list'])]

vendors, data = zip(*a)

pd.DataFrame(data)\
  .groupby(list(vendors))\
  .apply(np.transpose)\
  .reset_index(drop=True, level=1)

               3       4       5
vendor1  option1   small  yellow
vendor1  option2    None   black
vendor1  option3    None   green
vendor2  option3   small  yellow
vendor2     None  medium   green
vendor2     None   large    None

【讨论】：

我明天去试试。我做了一件事改变了这里的情况：我们可以假设 col_name 和 col_options 总是相同的吗？无需增加 +1
@PiotrKonopnicki 是的，只需将 col_options 静态而不是动态 f 字符串
不知何故，这段代码在 jupyter-notebook-5.7.8 Python 3.7.3 python windows 上使我的内核崩溃。有什么想法吗？
@AshutoshParida 你说的崩溃是什么意思？
python 进程终止。 jupyter notebook 上的错误：内核似乎已经死机。它会自动重启。

【解决方案2】：

我不知道可以将这种类型的字典转换为所需字典的 pandas 函数。您必须构建可以提供给 DataFrame 工厂并在连接它们之后的中间字典。

下面的代码应该可以解决问题：

dictionary = [{
    'vendor': 'vendor1',
    'option_list': [{
        'col1_name': 'Column1',
        'col1_options': ['option1', 'option2', 'option3']
        }, {
        'col2_name': 'Column2',
        'col2_options': ['small']
        },  {
        'col3_name': 'Column3',
        'col3_options': ['yellow', 'black', 'green']
        }
    ]
},  {
    'vendor': 'vendor2',
    'option_list': [{
        'col1_name': 'Column1',
        'col1_options': ['option3']
        }, {
        'col2_name': 'Column2',
        'col2_options': ['small', 'medium', 'large']
        }, {
        'col3_name': 'Column3',
        'col3_options': ['yellow', 'green']
        }
    ]
}]

to_concat = []
for one_vendor_dict in dictionary:
    new_option_dict = {}
    for option_dict in one_vendor_dict['option_list']:
        column_name, option_value = None, None
        # get column name and column values
        for k, v in option_dict.items():
            if 'name' in k:
                column_name = v
            if 'options' in k:
                option_value = v
        if column_name and option_value:
            new_option_dict[column_name] = option_value

    # put all list to same length in order to build a dataframe.
    max_length = max([len(v) for v in new_option_dict.values()])
    for k, v in new_option_dict.items():
        if len(v) < max_length:
            new_option_dict.update({k: v + [None] * (max_length - len(v))})
    # add the vendor column
    new_option_dict.update({'Vendor': [one_vendor_dict['vendor']] * max_length})
    # create a dataframe for this vendor
    to_concat.append(pd.DataFrame(new_option_dict))
df = pd.concat(to_concat).reset_index(drop=True)

这个印刷品：

   Column1 Column2 Column3   Vendor
0  option1   small  yellow  vendor1
1  option2    None   black  vendor1
2  option3    None   green  vendor1
3  option3   small  yellow  vendor2
4     None  medium   green  vendor2
5     None   large    None  vendor2

如果您为一个供应商提供更多列，则concat 函数将在连接时填充None 或NaN。

我使用None，因为选项是字符串，但如果需要，isna 函数会正确检测到这一点。

【讨论】：

【解决方案3】：

尝试了不同的方法，使用 pandas 合并功能：

import pandas as pd
final_df=pd.DataFrame() # this will have the final data required

# loop thru dictionary and create the dataframe of required columns
for i in range(len(dictionary)):
    df0=pd.DataFrame([dictionary[i]['vendor']],columns=['vendor'])
    df1=pd.DataFrame((dictionary[i]['option_list'][0])['col1_options'],columns=['Column1'])
    df2=pd.DataFrame((dictionary[i]['option_list'][1])['col2_options'],columns=['Column2'])
    df3=pd.DataFrame((dictionary[i]['option_list'][2])['col3_options'],columns=['Column3'])

    # merge the dataframe using outer incase either df is emphasized 
    df_merg1= pd.merge(df1,df2,how='outer',left_index=True,right_index=True)
    df_merg2=pd.merge(df_merg1,df3,how='outer',left_index=True,right_index=True)

    # this needs to be expanded to fit the max 
    df0=pd.concat([df0]*df_merg2.shape[0],ignore_index=True)

    # this will have the required dataframe vendorwise
    df_merg3=pd.merge(df0,df_merg2,how='left',left_index=True,right_index=True)

    #keep concatenating for the final output
    final_df=pd.concat([final_df,df_merg3],axis=0,ignore_index=True)

#print final output
final_df

【讨论】：