Python Pandas 基于列分组并获取最大值，但基于另一列排除答案

【问题标题】：Python Pandas group based on column and get max, but exclude based on another columnPython Pandas 基于列分组并获取最大值，但基于另一列排除
【发布时间】：2017-10-28 14:12:28
【问题描述】：

我正在处理一些数据，并希望提取某个列的最大值，按不同的列分组。但是，我想根据另一列从最大计算中排除某些行。

例子：

df = pd.DataFrame({'Col1':['A','A','A','B','B','B','B'],
                   'Col2':['Build','Plan','Other','Test','Build','Other','Buy'],
                   'Col3':[2,5,17,5,13,12,12]})

我想获得 Col3 的最大值，按 Col1 分组，同时排除 Col2 中包含“Other”的任何行。因此，'A' 的 Col3 的最大值应该是 5，而不是 17。

我能够使用：df['new'] = df.groupby(['Col1'])['Col3'].transform(max) 获得按 Col1 分组的 Col3 的最大值，但是，这将为 A 提供 17 的值。

通过查看其他线程，我尝试使用：

x = df1.groupby(['Col1'])
x2 = x.apply(lambda g: g[g['Col2'] != 'Other'])

这似乎让我很接近（它的数据按 Col1 分组，并根据 Col2 删除了行）。但是，我似乎再也找不到基于 Col1 获得 Col3 最大值的方法了。

充其量我已经能够使用：x2['Col3'].max() 在删除 Col2 中带有“其他”的所有行后获得 Col3 的最大值。但是，我无法获得按 Col1 分组的 Col3 的最大值。

我想知道是否有一种方法可以使用内置的 Pandas 函数来相对简单地执行此操作，而不是创建一个全新的定制函数？

【问题讨论】：

先过滤你的数据框，然后按 Col1 分组。
df.query('Col2 != "Other"').groupby('Col1')['Col3'].max()

标签： python pandas

【解决方案1】：

你可以试试

df[df.Col2 != 'Other'].groupby('Col1').Col3.max()

Col1
A     5
B    13

要创建一个新列：

df['new']=df[df.Col2 != 'Other'].groupby('Col1').Col3.transform('max')
df['new'] = df.new.ffill()

    Col1    Col2    Col3    new
0   A       Build   2       5.0
1   A       Plan    5       5.0
2   A       Other   17      5.0
3   B       Test    5       13.0
4   B       Build   13      13.0
5   B       Other   12      13.0
6   B       Buy     12      13.0

说明：只选择 df 中 Col2 值不等于 'Other' 的行，groupby Col1，找到 Col3 的最大值。

Here 是转换的文档：它返回一个带有转换值而不是聚合的类似索引的 df。

【讨论】：

@Scott Boston，事实上我想知道我是否应该发布，因为你已经在评论中回答了:)
@A-Za-z：永远不要让应该知道更好的人滥用评论部分阻止你发帖。 ;-)
太棒了，谢谢！你的两个答案都有效:) 你知道如何创建一个新列，我根据 Col1 中的值调用计算出的最大值吗？本质上是：df = pd.DataFrame({'Col1':['A','A','A','B','B','B','B'],'Col2':[' Build','Plan','Other','Test','Build','Other','Buy'], 'Col3':[2,5,17,5,13,12,12], 'new ':[5,5,5,13,13,13,13]}) 我试过 df['new'] = a[(df['Col1'])]，但这似乎不起作用。我收到以下错误：无法从重复的轴重新索引
谢谢！！完美工作:)
@pyman，我已经添加了一些解释。如果对您有帮助，也请接受答案。谢谢：）

【解决方案2】：

使用groupby 的另一种混合方式

df.groupby([df.Col2.ne('Other'), 'Col1']).Col3.max()[True]

Col1
A     5
B    13
Name: Col3, dtype: int64

【讨论】：

【解决方案3】：

@Vaishali 的回答是一个好的开始，但我认为在应用 ffill 来摆脱 na 时可能会出现一些问题。要使此方法起作用，您需要以更特殊的方式对数据框进行排序。要被说服，试试这个：

df = pd.DataFrame({'Col1':['A','A','A','B','B','B','B',"C", "C"],
               'Col2':['Build','Plan','Other','Test','Build','Other','Buy', "Buy","Other"],
               'Col3':[2,5,17,5,13,12,12,14,5]})
df = df.sample(frac=1) #shuffle rows

df['new']=df[df.Col2 != 'Other'].groupby('Col1')["Col3"].transform('max')
df['new'] = df.new.ffill()

你会得到这个不好的结果。

Col1    Col2    Col3    new
3   B   Test    5   13.0
7   C   Buy     14  14.0
6   B   Buy     12  13.0
1   A   Plan    5   5.0
0   A   Build   2   5.0
5   B   Other   12  5.0
8   C   Other   5   5.0
4   B   Build   13  13.0
2   A   Other   17  13.0

更好的解决方案：先定义这个函数。

def new_transform(df, exclude_cond,gbycol,target, agg_fun, ignore_value=None):
    df['target_temp'] = df[target] 
    df.loc[eval(exclude_cond), 'target_temp'] = ignore_value
    tmp=df.groupby(gbycol)['target_temp'].transform(agg_fun)
    df.drop('target_temp', axis=1, inplace=True)
    return tmp

它将您的数据框、您的 exculde_cond 作为字符串、您的 groupby 作为字符串列表或字符串、目标：我们在其上进行计算的列名、聚合函数和聚合函数忽略的值（无对于主要的 agg 函数）。

例子：

df = pd.DataFrame({'Col1':['A','A','A','B','B','B','B',"C", "C"],
                   'Col2':['Build','Plan','Other','Test','Build','Other','Buy', "Buy","Other"],
                   'Col3':[2,5,17,5,13,12,12,14,5]})
df = df.sample(frac=1)
df['new']=new_transform(df, "df['Col2']=='Build'", ['Col1'],'Col3', 'sum', np.nan)

我们得到了正确的计算：

  Col1  Col2    Col3    new
3   B   Test    5     29.0
2   A   Other   17    22.0
4   B   Build   13    29.0
6   B   Buy     12    29.0
7   C   Buy     14    19.0
1   A   Plan    5     22.0
5   B   Other   12    29.0
0   A   Build   2     22.0
8   C   Other   5     19.0

【讨论】：