Pandas PerformanceWarning：DataFrame 高度分散。什么是有效的解决方案？答案

【问题标题】：Pandas PerformanceWarning: DataFrame is highly fragmented. Whats the efficient solution?Pandas PerformanceWarning：DataFrame 高度分散。什么是有效的解决方案？
【发布时间】：2022-01-01 21:10:44
【问题描述】：

这是一个通用代码，代表我的脚本中发生的事情：

import pandas as pd
import numpy as np

dic = {}

for i in np.arange(0,10):
    dic[str(i)] = df = pd.DataFrame(np.random.randint(0,1000,size=(5000, 20)), 
                                    columns=list('ABCDEFGHIJKLMNOPQRST'))
    
df_out = pd.DataFrame(index = df.index)

for i in np.arange(0,10):
    df_out['A_'+str(i)] = dic[str(i)]['A'].astype('int')
    df_out['D_'+str(i)] = dic[str(i)]['D'].astype('int')
    df_out['H_'+str(i)] = dic[str(i)]['H'].astype('int')
    df_out['I_'+str(i)] = dic[str(i)]['I'].astype('int')
    df_out['M_'+str(i)] = dic[str(i)]['M'].astype('int')
    df_out['O_'+str(i)] = dic[str(i)]['O'].astype('int')
    df_out['Q_'+str(i)] = dic[str(i)]['Q'].astype('int')
    df_out['R_'+str(i)] = dic[str(i)]['R'].astype('int')
    df_out['S_'+str(i)] = dic[str(i)]['S'].astype('int')
    df_out['T_'+str(i)] = dic[str(i)]['T'].astype('int')
    df_out['C_'+str(i)] = dic[str(i)]['C'].astype('int')

您会注意到，只要插入列的 df_out（输出）数超过 100，我就会收到以下警告：

PerformanceWarning：DataFrame 高度碎片化。这通常是多次调用frame.insert的结果，性能较差。考虑改用 pd.concat

问题是我该如何使用：

pd.concat()

并且仍然有依赖于字典键的自定义列名？

重要提示：我仍然想保留特定的列选择，而不是全部。就像示例中一样： A, D , H , I 等...

特别编辑（基于 Corralien 的回答）

cols = {'A': 'float',
        'D': 'bool'}

out = pd.DataFrame()
for c, df in dic.items():
    for col, ftype in cols.items():
        out = pd.concat([out,df[[col]].add_suffix(f'_{c}')], 
                        axis=1).astype(ftype)

非常感谢您的帮助！

【问题讨论】：

标签： pandas insert concatenation

【解决方案1】：

您可以使用pd.concat 的理解：

cols = {'A': 'float', 'D': 'bool'}

out = pd.concat([df[cols].astype(cols).add_prefix(f'{k}_')
                    for k, df in dic.items()], axis=1)
print(out)

# Output:
     0_A   0_D    1_A   1_D    2_A   2_D    3_A   3_D
0  116.0  True  396.0  True  944.0  True  398.0  True
1  128.0  True  102.0  True  561.0  True   70.0  True
2  982.0  True  613.0  True  822.0  True  246.0  True
3  830.0  True  366.0  True  861.0  True  906.0  True
4  533.0  True  741.0  True  305.0  True  874.0  True

【讨论】：

也是完全有效的答案。可惜我不能同时接受这两种选择...
没问题。重要的是，即使您最后使用我的解决方案，它也对您有用：-P。大声笑
我应该在哪里添加 .astype('int') 你的答案？在 add_prefix 之前/之后？
在pd.concat末尾使用。
我更新了我的答案。在这种情况下，请使用您的 dict 在 add_prefix 之前转换 dtypes。

【解决方案2】：

在map 中使用concat 和扁平化MultiIndex：

cols = ['A','D']
df_out = pd.concat({k: v[cols] for k, v in dic.items()}, axis=1).astype('int')
df_out.columns = df_out.columns.map(lambda x: f'{x[1]}_{x[0]}')

print (df_out)
   A_0  D_0  A_1  D_1  A_2  D_2  A_3  D_3
0  116  341  396  502  944  483  398  839
1  128  621  102   70  561  656   70  169
2  982   44  613  775  822  379  246   25
3  830  987  366  481  861  632  906  676
4  533  349  741  410  305  422  874   19

【讨论】：

如果我是对的，您的回答假设字典中所有 dfs 的所有列都被采用。我只想拿一些特定的专栏，但不是全部。至少要选择的特定列在字典中的所有 dfs 中都是相同的。
假设我只想拥有 A 和 D，因此结果将具有以下列：A_0 D_0、A_1 D_1、A_2 D_2、A_3 D_3、
@plonfat - 添加到答案中。
知道为什么简单的 df['new_col'] = col 效率不高吗？对我来说，写起来更容易/更轻松......
@plonfat - 多次迭代，如果只有少数，那就完美了。