Python pandas groupby 在多列上聚合，然后旋转答案

【问题标题】：Python pandas groupby aggregate on multiple columns, then pivotPython pandas groupby 在多列上聚合，然后旋转
【发布时间】：2017-08-27 15:09:15
【问题描述】：

在 Python 中，我有一个类似于以下内容的 pandas DataFrame：

Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45    | 50    | 53    | Clothes
TV   | 200   | 300   | 250   | Technology
Book | 20    | 17    | 21    | Books
phone| 300   | 350   | 400   | Technology

其中 shop1、shop2 和 shop3 是不同商店中每件商品的成本。现在，我需要在一些数据清理之后返回一个 DataFrame，就像这样：

Category (index)| size| sum| mean | std
----------------------------------------

其中 size 是每个 Category 中的项目数， sum、mean 和 std 与应用于 3 个商店的相同函数相关。如何使用 split-apply-combine 模式（groupby、aggregate、apply...）进行这些操作？

有人可以帮帮我吗？这个我快疯了……谢谢！

【问题讨论】：

标签： python pandas dataframe pivot data-cleaning

【解决方案1】：

针对 Pandas 0.22+ 进行了编辑，考虑到不赞成通过聚合在组中使用字典。

我们建立了一个非常相似的字典，我们使用字典的键来指定我们的功能，并使用字典本身来重命名列。

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
  .agg(rnm_cols.keys()).rename(columns=rnm_cols)

            Size   Sum        Mean        Std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

选项 1
使用 agg ← 链接到文档

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

                  Std   Sum        Mean  Size
Category                                     
Books        2.081666    58   19.333333     3
Clothes      4.041452   148   49.333333     3
Technology  70.710678  1800  300.000000     6

选项 2
事半功倍
使用 describe ← 链接到文档

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

            count        mean        std    min    25%    50%    75%    max
Category                                                                   
Books         3.0   19.333333   2.081666   17.0   18.5   20.0   20.5   21.0
Clothes       3.0   49.333333   4.041452   45.0   47.5   50.0   51.5   53.0
Technology    6.0  300.000000  70.710678  200.0  262.5  300.0  337.5  400.0

【讨论】：

感谢您的回答@piRSquared，如果我们想为同一个列字典应用多个函数是行不通的。有什么办法可以处理吗？
@CanCeylan 这在 Pandas 系列中使用 groupby 和聚合。它对 DataFrame 的行为有所不同。

【解决方案2】：

df.groupby('Category').agg({'Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std']})

或者，如果您想在所有商店中使用它：

df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns={'level_2':'Shops',0:'costs'})
df1.groupby('Category').agg({'Item':'size','costs':['sum','mean','std']})

【讨论】：

【解决方案3】：

如果我理解正确，您想计算所有商店的聚合指标，而不是单独计算。为此，您可以先stack 您的数据框，然后按Category 分组：

stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg({'Price':['count','sum','mean','std']})

这会导致

           Price                             
           count   sum        mean        std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

【讨论】：