Pandas groupby 多个聚合，不包括焦点组答案

【问题标题】：Pandas groupby multiple aggregations with exclusion of the focal groupPandas groupby 多个聚合，不包括焦点组
【发布时间】：2020-12-22 13:23:00
【问题描述】：

我有一个如下所示的玩具数据集。

   Building  Department  feature1  feature2
0         A           1        14        28
1         A           1        11        26
2         A           1        29        19
3         A           2        26        28
4         A           2        22        27
5         A           2        20        24
6         A           2        15        14
7         A           2        30        21
8         A           3        30        15
9         A           3        16        29
10        A           3        25        23
11        A           3        26        15
12        A           3        11        11

我要计算这些变量：

对于每个建筑物和部门，Score1 是该部门的 feature1 和 feature2 的平均值（展平并做平均，没有花哨的东西）
对于每个建筑物和部门，Score2 是 feature1 的平均值，feature2 是不包括该部门（即焦点组）的平均值。

因此，对于 Department1，Score1 将根据 Department 1 的平均值计算，但 Score2 将根据 Department 2 和 3 计算。

最终结果：

  Building  Department  Score1  Score2
0        A           1   21.16  21.400
1        A           2   22.70  20.500
2        A           3   20.10  22.125

对于这种“排除”，我找不到任何pandas 快捷方式。一种可能的解决方案是遍历组并像这样计算它，但是我的数据对于这样的循环来说太大了。

任何帮助，提示表示赞赏！谢谢

【问题讨论】：

对于 score2 其他建筑物重要吗？
分数 2 也是按建筑计算的。因此，不考虑其他建筑物，只考虑焦点建筑物

标签： python pandas dataframe pandas-groupby

【解决方案1】：

你可以这样做：

# stack to reshape the dataframe
s = df.set_index(['Building', 'Department']).stack()

# groupby and aggregate 
m1 = s.groupby(level=[0, 1]).agg(['sum', 'count'])
m2 = s.groupby(level=0).agg(['sum', 'count']) - m1

# compute mean=sum/count and concatenate along axis=1
out = pd.concat([m1['sum'] / m1['count'], m2['sum'] / m2['count']],
                   axis=1, keys=['Score1', 'Score2']).reset_index()

详情：

首先将building和department设置为dataframe和stack的索引以reshape以扁平化特征：

# s
Building  Department          
A         1           feature1    14
                      feature2    28
                      feature1    11
                      feature2    26
                      feature1    29
                      feature2    19
          2           feature1    26
                      feature2    28
                      feature1    22
                      feature2    27
                      feature1    20
                      feature2    24
                      feature1    15
                      feature2    14
                      feature1    30
                      feature2    21
          3           feature1    30
                      feature2    15
                      feature1    16
                      feature2    29
                      feature1    25
                      feature2    23
                      feature1    26
                      feature2    15
                      feature1    11
                      feature2    11

groupby 堆叠数据帧并使用sum 和count 聚合：

# m1 : sum and count per building and department
                     sum  count
Building Department            
A        1           127      6
         2           227     10
         3           201     10

# m2 : sum and count per building - m1
                     sum  count
Building Department            
A        1           428     20
         2           328     16
         3           354     16

计算m1 (Score1) 和m2 (Score2) 的平均值，方法是将sum 列除以count，并将这些平均值与@987654333 连接起来@得到想要的结果：

# out
  Building  Department     Score1  Score2
0        A           1  21.166667  21.400
1        A           2  22.700000  20.500
2        A           3  20.100000  22.125

【讨论】：

【解决方案2】：

这是我的建议：

df['feature12']=(df['feature1']+df['feature2'])/2

dfsum=sum(df.feature12)
dflen=len(df.feature12)

pv=pd.pivot_table(df, index=['Building', 'Department'], values='feature12', aggfunc=['mean', 'count'])

pv['Score2']=((dfsum)-(pv['mean']['feature12']*pv['count']['feature12']))/(dflen-pv['count']['feature12'])
pv['Score1']=pv['mean']['feature12']

res=pv[['Score1', 'Score2']]

res.columns=res.columns.get_level_values(0)
res=res.reset_index(level=[0,1])


Output:

>>> print(res)
  Building  Department     Score1  Score2
0        A           1  21.166667  21.400
1        A           2  22.700000  20.500
2        A           3  20.100000  22.125

【讨论】：

【解决方案3】：

两步法（我相信它会更干净）：

# Calculating feature1: 
agg_df = (df
          .assign(feature1 = lambda x: (x['feature1'] + x['feature2'])/2).reset_index(drop=True)
          .groupby(['Building', 'Department'])
          .feature1
          .mean()
          .reset_index())

# Calculating feature2: 
agg_df['feature2'] = [*map(lambda i: (agg_df[~agg_df.Department.isin([agg_df.iloc[i].Department])]
                           .feature1
                           .mean()), 
                           range(0, len(agg_df)))]

df 在哪里：

df = pd.DataFrame({'Building': {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'A',
  8: 'A',
  9: 'A',
  10: 'A',
  11: 'A',
  12: 'A',
  13: 'B',
  14: 'B',
  15: 'B',
  16: 'B',
  17: 'B',
  18: 'B',
  19: 'B',
  20: 'B',
  21: 'B',
  22: 'B',
  23: 'B',
  24: 'B',
  25: 'B',
  26: 'C',
  27: 'C',
  28: 'C',
  29: 'C',
  30: 'C',
  31: 'C',
  32: 'C',
  33: 'C',
  34: 'C',
  35: 'C',
  36: 'C',
  37: 'C',
  38: 'C'},
 'Department': {0: 1,
  1: 1,
  2: 1,
  3: 2,
  4: 2,
  5: 2,
  6: 2,
  7: 2,
  8: 3,
  9: 3,
  10: 3,
  11: 3,
  12: 3,
  13: 1,
  14: 1,
  15: 1,
  16: 2,
  17: 2,
  18: 2,
  19: 2,
  20: 2,
  21: 3,
  22: 3,
  23: 3,
  24: 3,
  25: 3,
  26: 1,
  27: 1,
  28: 1,
  29: 2,
  30: 2,
  31: 2,
  32: 2,
  33: 2,
  34: 3,
  35: 3,
  36: 3,
  37: 3,
  38: 3},
 'feature1': {0: 14,
  1: 11,
  2: 29,
  3: 26,
  4: 22,
  5: 20,
  6: 15,
  7: 30,
  8: 30,
  9: 16,
  10: 25,
  11: 26,
  12: 11,
  13: 11,
  14: 11,
  15: 29,
  16: 26,
  17: 22,
  18: 11,
  19: 15,
  20: 30,
  21: 30,
  22: 3,
  23: 25,
  24: 26,
  25: 11,
  26: 4,
  27: 11,
  28: 5,
  29: 45,
  30: 22,
  31: 20,
  32: 66,
  33: 30,
  34: 30,
  35: 78,
  36: 25,
  37: 26,
  38: 11},
 'feature2': {0: 28,
  1: 26,
  2: 19,
  3: 28,
  4: 27,
  5: 24,
  6: 14,
  7: 21,
  8: 15,
  9: 29,
  10: 23,
  11: 15,
  12: 11,
  13: 28,
  14: 26,
  15: 19,
  16: 28,
  17: 27,
  18: 24,
  19: 14,
  20: 21,
  21: 15,
  22: 29,
  23: 23,
  24: 15,
  25: 11,
  26: 28,
  27: 26,
  28: 19,
  29: 28,
  30: 27,
  31: 24,
  32: 14,
  33: 21,
  34: 15,
  35: 29,
  36: 23,
  37: 15,
  38: 11}})

【讨论】：

你也需要按部门分组
@IoaTzimas 感谢您指出这一点，请参阅上面编辑的解决方案。 (+1) 顺便说一句。