在一个组内按降序排列的累积和。熊猫答案

【问题标题】：Cumulative sum sorted descending within a group. Pandas在一个组内按降序排列的累积和。熊猫
【发布时间】：2019-10-03 21:59:14
【问题描述】：

我在组内应用 sort_values() 和 cumsum() 时遇到了问题。

我有一个数据集：

基本上，我需要对组内的值进行排序，获取累计销售额并选择那些占销售额 90% 的行。

抢到第一

然后，只需选择每个区域内 90% 的销售额

我尝试了以下方法，但最后一行不起作用。我返回错误：无法访问“SeriesGroupBy”对象的可调用属性“sort_values”，请尝试使用“应用”方法

我也试过申请..

import pandas as pd
df = pd.DataFrame({'id':['id_1', 
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1', 
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
               'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
               'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()

感谢您的建议

【问题讨论】：

标签： python pandas

【解决方案1】：

你绝对可以先对数据框进行排序，然后再做groupby():

df.sort_values(['region','sales'], ascending=[True,False],inplace=True)

df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')

df['cummul'] = df.groupby('region')['%'].cumsum()

# filter
df[df['cummul'].le(0.9)]

输出：

      id  region  sales         %    cummul
5   id_6       1     98  0.216336  0.216336
4   id_5       1     78  0.172185  0.388521
6   id_7       1     76  0.167770  0.556291
3   id_4       1     56  0.123620  0.679912
0   id_1       1     54  0.119205  0.799117
1   id_2       1     34  0.075055  0.874172
9   id_2       2     89  0.204598  0.204598
10  id_3       2     76  0.174713  0.379310
14  id_7       2     56  0.128736  0.508046
11  id_4       2     54  0.124138  0.632184
15  id_8       2     54  0.124138  0.756322
13  id_6       2     45  0.103448  0.859770

【讨论】：

不错的一个顺便说一句，你有问题 stackover auto log you off 只是在几分钟前发生在我身上
谢谢@Quang Hoang 我不认为排序可以通过两个变量来完成。事实证明，一切都比我最初想象的要容易:)
@WeNYoBen 不，这没有发生在我身上。
@Quang Hoang，是否可以按多列进行排序？我正在尝试用第三个变量来实现一些东西，但它不起作用..
是的，您可以使用sort_values([col1,col2,col3,col4...]) 并以与列列表相同的长度传递ascending = [True, False,...]。

【解决方案2】：

首先我们使用您的逻辑创建% 列，但我们将multiply 由100 和round 转换为整数。

然后我们按region和%排序，不需要groupby。

排序后，我们创建cumul 列。

最后我们用query选择90%范围内的那些：

df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()

df.query('cumul.le(90)')

输出

      id  region  sales     %  cumul
5   id_6       1     98  22.0   22.0
4   id_5       1     78  17.0   39.0
6   id_7       1     76  17.0   56.0
0   id_1       1     54  12.0   68.0
3   id_4       1     56  12.0   80.0
1   id_2       1     34   8.0   88.0
9   id_2       2     89  20.0   20.0
10  id_3       2     76  17.0   37.0
14  id_7       2     56  13.0   50.0
11  id_4       2     54  12.0   62.0
15  id_8       2     54  12.0   74.0
13  id_6       2     45  10.0   84.0

【讨论】：

【解决方案3】：

如果您只需要没有百分比的销售数据，这可以通过方法链接轻松完成：

(
  df
  .sort_values(by='sales', ascending=False)
  .groupby('region')
  .apply(lambda x[x.sales > x.sales.quantile(.1)])
  .reset_index(level=0, drop=True)
)

输出

      id  region  sales
5   id_6       1     98
4   id_5       1     78
6   id_7       1     76
3   id_4       1     56
0   id_1       1     54
1   id_2       1     34
7   id_8       1     34
9   id_2       2     89
10  id_3       2     76
14  id_7       2     56
11  id_4       2     54
15  id_8       2     54
13  id_6       2     45
12  id_5       2     34

之所以有效，是因为获得大于 10% 的所有值与获得前 90% 的值基本相同。

【讨论】：

谢谢@shwanky！我没有尝试先排序然后应用分组，因为我认为整个列都会被排序。事实上，它按预期工作。谢谢
@vero 您可能还想考虑interpolation method of quantile，因为分位数值可能在您的两个数据点之间。