Python pandas：在两列上分组并创建新变量答案

【问题标题】：Python pandas : groupby on two columns and create new variablesPython pandas：在两列上分组并创建新变量
【发布时间】：2019-01-29 18:17:38
【问题描述】：

我有以下数据框来描述某类投资者在公司中持有的股份百分比：

    company  investor   pct 
       1       A         1
       1       A         2
       1       B         4
       2       A         2
       2       A         4
       2       A         6 
       2       C         10
       2       C         8

我想为每种投资者类型创建一个新列，计算每家公司所持股份的平均值。我还需要保持数据集的相同长度，例如使用变换。

这是我想要的结果：

     company  investor   pct   pct_mean_A   pct_mean_B   pct_mean_C
       1       A         1        1.5          4            0
       1       A         2        1.5          4            0
       1       B         4        1.5          4            0
       2       A         2        4.0          0            9
       2       A         4        4.0          0            9
       2       A         6        4.0          0            9
       2       C         10       4.0          0            9
       2       C         8        4.0          0            9

非常感谢您的帮助！

【问题讨论】：

标签： python pandas transform

【解决方案1】：

使用groupby 和聚合mean 并通过unstack 重塑助手DataFrame，即join 到原始df：

s = (df.groupby(['company','investor'])['pct']
       .mean()
       .unstack(fill_value=0)
       .add_prefix('pct_mean_'))

df = df.join(s, 'company')
print (df)
   company investor  pct  pct_mean_A  pct_mean_B  pct_mean_C
0        1        A    1         1.5         4.0         0.0
1        1        A    2         1.5         4.0         0.0
2        1        B    4         1.5         4.0         0.0
3        2        A    2         4.0         0.0         9.0
4        2        A    4         4.0         0.0         9.0
5        2        A    6         4.0         0.0         9.0
6        2        C   10         4.0         0.0         9.0
7        2        C    8         4.0         0.0         9.0

或者使用pivot_table和默认聚合函数mean：

s = df.pivot_table(index='company',
                   columns='investor',
                   values='pct', 
                   fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
   company investor  pct  pct_mean_A  pct_mean_B  pct_mean_C
0        1        A    1         1.5           4           0
1        1        A    2         1.5           4           0
2        1        B    4         1.5           4           0
3        2        A    2         4.0           0           9
4        2        A    4         4.0           0           9
5        2        A    6         4.0           0           9
6        2        C   10         4.0           0           9
7        2        C    8         4.0           0           9

【讨论】：

OP 的数据框有错误的 pct_mean_A 值，即 12。
@SandeepKadapa - 它发生了 ;)