【问题标题】:Pandas groupby multiple aggregations with exclusion of the focal groupPandas groupby 多个聚合,不包括焦点组
【发布时间】:2020-12-22 13:23:00
【问题描述】:

我有一个如下所示的玩具数据集。

   Building  Department  feature1  feature2
0         A           1        14        28
1         A           1        11        26
2         A           1        29        19
3         A           2        26        28
4         A           2        22        27
5         A           2        20        24
6         A           2        15        14
7         A           2        30        21
8         A           3        30        15
9         A           3        16        29
10        A           3        25        23
11        A           3        26        15
12        A           3        11        11

我要计算这些变量:

  1. 对于每个建筑物和部门,Score1 是该部门的 feature1feature2 的平均值(展平并做平均,没有花哨的东西)
  2. 对于每个建筑物和部门,Score2feature1 的平均值,feature2 是不包括该部门(即焦点组)的平均值。

因此,对于 Department1,Score1 将根据 Department 1 的平均值计算,但 Score2 将根据 Department 2 和 3 计算。

最终结果:

  Building  Department  Score1  Score2
0        A           1   21.16  21.400
1        A           2   22.70  20.500
2        A           3   20.10  22.125

对于这种“排除”,我找不到任何pandas 快捷方式。一种可能的解决方案是遍历组并像这样计算它,但是我的数据对于这样的循环来说太大了。

任何帮助,提示表示赞赏!谢谢

【问题讨论】:

  • 对于 score2 其他建筑物重要吗?
  • 分数 2 也是按建筑计算的。因此,不考虑其他建筑物,只考虑焦点建筑物

标签: python pandas dataframe pandas-groupby


【解决方案1】:

你可以这样做:

# stack to reshape the dataframe
s = df.set_index(['Building', 'Department']).stack()

# groupby and aggregate 
m1 = s.groupby(level=[0, 1]).agg(['sum', 'count'])
m2 = s.groupby(level=0).agg(['sum', 'count']) - m1

# compute mean=sum/count and concatenate along axis=1
out = pd.concat([m1['sum'] / m1['count'], m2['sum'] / m2['count']],
                   axis=1, keys=['Score1', 'Score2']).reset_index()

详情:

首先将buildingdepartment设置为dataframe和stack的索引以reshape以扁平化特征:

# s
Building  Department          
A         1           feature1    14
                      feature2    28
                      feature1    11
                      feature2    26
                      feature1    29
                      feature2    19
          2           feature1    26
                      feature2    28
                      feature1    22
                      feature2    27
                      feature1    20
                      feature2    24
                      feature1    15
                      feature2    14
                      feature1    30
                      feature2    21
          3           feature1    30
                      feature2    15
                      feature1    16
                      feature2    29
                      feature1    25
                      feature2    23
                      feature1    26
                      feature2    15
                      feature1    11
                      feature2    11

groupby 堆叠数据帧并使用sumcount 聚合:

# m1 : sum and count per building and department
                     sum  count
Building Department            
A        1           127      6
         2           227     10
         3           201     10

# m2 : sum and count per building - m1
                     sum  count
Building Department            
A        1           428     20
         2           328     16
         3           354     16

计算m1 (Score1)m2 (Score2) 的平均值,方法是将sum 列除以count,并将这些平均值与@987654333 连接起来@得到想要的结果:

# out
  Building  Department     Score1  Score2
0        A           1  21.166667  21.400
1        A           2  22.700000  20.500
2        A           3  20.100000  22.125

【讨论】:

    【解决方案2】:

    这是我的建议:

    df['feature12']=(df['feature1']+df['feature2'])/2
    
    dfsum=sum(df.feature12)
    dflen=len(df.feature12)
    
    pv=pd.pivot_table(df, index=['Building', 'Department'], values='feature12', aggfunc=['mean', 'count'])
    
    pv['Score2']=((dfsum)-(pv['mean']['feature12']*pv['count']['feature12']))/(dflen-pv['count']['feature12'])
    pv['Score1']=pv['mean']['feature12']
    
    res=pv[['Score1', 'Score2']]
    
    res.columns=res.columns.get_level_values(0)
    res=res.reset_index(level=[0,1])
    
    
    Output:
    
    >>> print(res)
      Building  Department     Score1  Score2
    0        A           1  21.166667  21.400
    1        A           2  22.700000  20.500
    2        A           3  20.100000  22.125
    

    【讨论】:

      【解决方案3】:

      两步法(我相信它会更干净):

      # Calculating feature1: 
      agg_df = (df
                .assign(feature1 = lambda x: (x['feature1'] + x['feature2'])/2).reset_index(drop=True)
                .groupby(['Building', 'Department'])
                .feature1
                .mean()
                .reset_index())
      
      # Calculating feature2: 
      agg_df['feature2'] = [*map(lambda i: (agg_df[~agg_df.Department.isin([agg_df.iloc[i].Department])]
                                 .feature1
                                 .mean()), 
                                 range(0, len(agg_df)))]
      

      df 在哪里:

      df = pd.DataFrame({'Building': {0: 'A',
        1: 'A',
        2: 'A',
        3: 'A',
        4: 'A',
        5: 'A',
        6: 'A',
        7: 'A',
        8: 'A',
        9: 'A',
        10: 'A',
        11: 'A',
        12: 'A',
        13: 'B',
        14: 'B',
        15: 'B',
        16: 'B',
        17: 'B',
        18: 'B',
        19: 'B',
        20: 'B',
        21: 'B',
        22: 'B',
        23: 'B',
        24: 'B',
        25: 'B',
        26: 'C',
        27: 'C',
        28: 'C',
        29: 'C',
        30: 'C',
        31: 'C',
        32: 'C',
        33: 'C',
        34: 'C',
        35: 'C',
        36: 'C',
        37: 'C',
        38: 'C'},
       'Department': {0: 1,
        1: 1,
        2: 1,
        3: 2,
        4: 2,
        5: 2,
        6: 2,
        7: 2,
        8: 3,
        9: 3,
        10: 3,
        11: 3,
        12: 3,
        13: 1,
        14: 1,
        15: 1,
        16: 2,
        17: 2,
        18: 2,
        19: 2,
        20: 2,
        21: 3,
        22: 3,
        23: 3,
        24: 3,
        25: 3,
        26: 1,
        27: 1,
        28: 1,
        29: 2,
        30: 2,
        31: 2,
        32: 2,
        33: 2,
        34: 3,
        35: 3,
        36: 3,
        37: 3,
        38: 3},
       'feature1': {0: 14,
        1: 11,
        2: 29,
        3: 26,
        4: 22,
        5: 20,
        6: 15,
        7: 30,
        8: 30,
        9: 16,
        10: 25,
        11: 26,
        12: 11,
        13: 11,
        14: 11,
        15: 29,
        16: 26,
        17: 22,
        18: 11,
        19: 15,
        20: 30,
        21: 30,
        22: 3,
        23: 25,
        24: 26,
        25: 11,
        26: 4,
        27: 11,
        28: 5,
        29: 45,
        30: 22,
        31: 20,
        32: 66,
        33: 30,
        34: 30,
        35: 78,
        36: 25,
        37: 26,
        38: 11},
       'feature2': {0: 28,
        1: 26,
        2: 19,
        3: 28,
        4: 27,
        5: 24,
        6: 14,
        7: 21,
        8: 15,
        9: 29,
        10: 23,
        11: 15,
        12: 11,
        13: 28,
        14: 26,
        15: 19,
        16: 28,
        17: 27,
        18: 24,
        19: 14,
        20: 21,
        21: 15,
        22: 29,
        23: 23,
        24: 15,
        25: 11,
        26: 28,
        27: 26,
        28: 19,
        29: 28,
        30: 27,
        31: 24,
        32: 14,
        33: 21,
        34: 15,
        35: 29,
        36: 23,
        37: 15,
        38: 11}})
      

      【讨论】:

      • 你也需要按部门分组
      • @IoaTzimas 感谢您指出这一点,请参阅上面编辑的解决方案。 (+1) 顺便说一句。
      猜你喜欢
      • 2018-05-01
      • 2020-11-05
      • 2018-07-17
      • 2021-11-01
      • 2019-10-12
      • 2018-10-17
      • 1970-01-01
      • 2018-07-01
      • 2017-08-08
      相关资源
      最近更新 更多