按两个元素分组的数据框的统计信息答案

【问题标题】：Statistics of dataframe grouped by two elements按两个元素分组的数据框的统计信息
【发布时间】：2016-12-07 17:53:34
【问题描述】：

为了确定pandas dataframe 组的统计信息，我找到了Chris Albon 的解释，我想将其应用于按两个元素（此 MWE 中的“a”和“b”）分组的数据框。

所以这里有一个计算一些组统计数据的函数：

def get_group_stats(group):
    return {'count': group.count().add_prefix('count_'),
            'mean': group.mean().add_prefix('mean_'),
            'sum': group.sum().add_prefix('sum_')}

数据框df的定义：

df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
                    'b':['A','A','B','A','B','A'],
                    'c':[ 1, 2, 5, 5, 4, 6 ]})

然后创建按“a”和“b”分组的统计表：

s1 = df.groupby(['a', 'b']).apply(get_group_stats)

但是建议的unstack() 函数没有正确合并数据帧。我想要什么：

    a    |    b    | count_c | mean_c  | sum_c
-------------------------------------------------
    A    |    A    |    2    |   1.5   |   3.0
    B    |    A    |    1    |   5.0   |   5.0
    B    |    B    |    2    |   4.5   |   9.0
    C    |    B    |    1    |   6.0   |   6.0

【问题讨论】：

标签： python python-3.x pandas

【解决方案1】：

你需要用apply返回Series：

def get_group_stats(group):
    return pd.Series({'count': group.c.count(),
                      'mean': group.c.mean(),
                      'sum': group.c.sum()})


s1 = df.groupby(['a', 'b']).apply(get_group_stats).add_suffix('_c')
print (s1)
     count_c  mean_c  sum_c
a b                        
A A      2.0     1.5    3.0
B A      1.0     5.0    5.0
  B      2.0     4.5    9.0
C A      1.0     6.0    6.0

但更好的是使用aggregate by list 的函数：

s1 = df.groupby(['a', 'b'])['c'].agg(['count','mean','sum']).add_suffix('_c').reset_index()
print (s1)
   a  b  count_c  mean_c  sum_c
0  A  A        2     1.5      3
1  B  A        1     5.0      5
2  B  B        2     4.5      9
3  C  A        1     6.0      6

【讨论】：

【解决方案2】：

您可以为此使用DataFrameGroupBy.agg：

In [1]: df.groupby(['a', 'b'])['c'].agg(['count','mean','sum']).add_suffix('_c')

Out[1]: 
     count_c  mean_c  sum_c
a b                        
A A        2     1.5      3
B A        1     5.0      5
  B        2     4.5      9
C A        1     6.0      6

如果您希望 a 和 b 作为列而不是索引，您也可以链接 reset_index()。

【讨论】：

优雅的解决方案！我添加了.reset_index()
@Api：@Jezrael 在我发布后立即将此agg 解决方案添加到他的答案中。你应该接受他的回答，它更完整。