基于多条件 pandas 的 Groupby 聚合答案

【问题标题】：Groupby aggregate based on multiple condition pandas基于多条件 pandas 的 Groupby 聚合
【发布时间】：2020-02-11 21:20:24
【问题描述】：

我有一个如下图所示的数据框

Sector    Plot    Year       Amount   Month
SE1       1       2017       10       Sep
SE1       1       2018       10       Oct
SE1       1       2019       10       Jun
SE1       1       2020       90       Feb
SE1       2       2018       50       Jan
SE1       2       2017       100      May
SE1       2       2018       30       Oct
SE2       2       2018       50       Mar
SE2       2       2019       100      Jan

从上面我想准备下面的数据框

Sector    Plot      Number_of_Times    Mean_Amount    Recent_Amount   Recent_year  Recent_Month    
SE1       1         4                  30             50              2020         Feb   
SE1       2         3                  60             30              2018         Oct
SE2       2         2                  75             100             2019         Jan

【问题讨论】：

请考虑添加关于预期输出/逻辑等的简要说明。
这能回答你的问题吗？ Apply multiple functions to multiple groupby columns
提供额外信息，说明您希望生成该输出的依据。

标签： pandas pandas-groupby

【解决方案1】：

因此，如果所有行都在输入数据中排序，请使用 GroupBy.agg 和命名聚合：

df1 = (df.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
                                         Mean_Amount=('Amount','mean'),
                                         Recent_Amount=('Amount','last'),
                                         Recent_year=('Year','last'),
                                         Recent_Month=('Month','last')).reset_index())
print (df1)
  Sector  Plot  Number_of_Times  Mean_Amount  Recent_Amount  Recent_year  \
0    SE1     1                4           30             90         2020   
1    SE1     2                3           60             30         2018   
2    SE2     2                2           75            100         2019   

  Recent_Month  
0          Feb  
1          Oct  
2          Jan

如有必要，将Month 排序转换为日期时间，添加DataFrame.sort_values，应用解决方案并最后将月份转换回字符串：

df['Month'] = pd.to_datetime(df['Month'], format='%b')

df1 = (df.sort_values(['Sector','Plot','Year','Month'])
         .groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
                                         Mean_Amount=('Amount','mean'),
                                         Recent_Amount=('Amount','last'),
                                         Recent_year=('Year','last'),
                                         Recent_Month=('Month','last')).reset_index())
df1['Recent_Month'] = df1['Recent_Month'].dt.strftime('%b')
print (df1)
  Sector  Plot  Number_of_Times  Mean_Amount  Recent_Amount  Recent_year  \
0    SE1     1                4           30             90         2020   
1    SE1     2                3           60             30         2018   
2    SE2     2                2           75            100         2019   

  Recent_Month  
0          Feb  
1          Oct  
2          Jan

另一个想法，pandas 0.25.1 中的 bug：

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Month']  = pd.Categorical(df['Month'] , ordered=True, categories=months)

df1 = (df.sort_values(['Sector','Plot','Year','Month'])
         .groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
                                         Mean_Amount=('Amount','mean'),
                                         Recent_Amount=('Amount','last'),
                                         Recent_year=('Year','last'),
                                         Recent_Month=('Month','last')).reset_index())

print (df1)

ValueError: 缓冲区 dtype 不匹配，预期为“Python 对象”但得到了“long long”

【讨论】：

我们能否为输出中的每一行获取相应的扇区和绘图