【问题标题】:How to select row with max value in column from pandas groupby() groups?如何从熊猫 groupby() 组中选择列中具有最大值的行?
【发布时间】:2021-12-08 19:56:26
【问题描述】:

我有一张这样的桌子:

import pandas as pd

df = pd.DataFrame(
        [
            ['john', 'rdgsdr', 2, 'A'],
            ['ann',  'dsdfds', 3, 'A'],
            ['john', 'jkfgdj', 1, 'B'],
            ['bob',  'xcxfcd', 5, 'A'],
            ['john', 'uityuu', 3, 'C'],
            ['ann',  'werwwe', 2, 'C'],
        ],
        columns=['name', 'stuff', 'orders', 'store']
    )

# df
#    name   stuff  orders store
# 0  john  rdgsdr       2     A
# 1   ann  dsdfds       3     A
# 2  john  jkfgdj       1     B
# 3   bob  xcxfcd       5     A
# 4  john  uityuu       3     C
# 5   ann  werwwe       2     C

我需要为每个名称提取具有最大订单数的行;并为该名称计算所有商店的列表。像这样:

grouped = df.groupby('name')

for name, group in grouped:
    print('-'*5, name, '-'*5)
    print(group)

# ----- ann -----
#   name   stuff  orders store
# 1  ann  dsdfds       3     A  <- max(orders) for ann
# 5  ann  werwwe       2     C
# ----- bob -----
#   name   stuff  orders store
# 3  bob  xcxfcd       5     A  <- max(orders) for bob
# ----- john -----
#    name   stuff  orders store
# 0  john  rdgsdr       2     A
# 2  john  jkfgdj       1     B
# 4  john  uityuu       3     C  <- max(orders) for john

# ##########################
# This is what I want to get
# ##########################
>>> result
   name   stuff  max orders  all stores
1  ann   dsdfds           3         A,C
3  bob   xcxfcd           5           A
4  john  uityuu           3       A,B,C

我试过了:

result = grouped.agg(
        **{
            # 'stuff': 'stuff',
            'max orders': pd.NamedAgg('orders', max),
            'all stores': pd.NamedAgg('store', lambda s: s.str.join(',')),
        }
    )

但我不知道如何在结果中包含“stuff”列(在我的实际应用中,我有很多这样的附加列,可能有几十个)。而且,连接给了我列表而不是字符串:

>>> result
   name  max orders all stores
0   ann           3     [A, C]
1   bob           5          A
2  john           3  [A, B, C]

【问题讨论】:

    标签: python pandas pandas-groupby


    【解决方案1】:

    试试first

    out = df.set_index('stuff').groupby('name').agg(stuff = ('orders' , 'idxmax'),
                                              max_orders = ('orders' , 'max'),
                                              all_stores = ('store',','.join))#.reset_index()
    Out[200]: 
           stuff  max_orders all_stores
    name                               
    ann   dsdfds           3        A,C
    bob   xcxfcd           5          A
    john  uityuu           3      A,B,C
    

    【讨论】:

    • 上面写着TypeError: Must provide 'func' or tuples of '(column, aggfunc).
    • 另外,我想我需要以某种方式告诉它我需要对应于max(orders)stuff 值,而不仅仅是组中的任何(或第一行)行。
    • @Amenhotep 检查更新
    • 正如我所说,我有几十个“东西”列,那只是一个例子。
    【解决方案2】:

    您可以通过将this answer 与 groupby 组合来获取他们工作过的商店列表。

    # Get stores that each person works at
    stores_for_each_name = df.groupby('name')['store'].apply(','.join)
    
    # Get row with largest order value for each name
    df = df.sort_values('orders', ascending=False).drop_duplicates('name').rename({'orders': 'max_orders'}, axis=1)
    
    # Replace store column with comma-separated list of stores they have worked at
    df = df.drop('store', axis=1)
    df = df.join(stores_for_each_name, on='name')
    

    输出:

       name   stuff  max_orders  store
    3   bob  xcxfcd           5      A
    1   ann  dsdfds           3    A,C
    4  john  uityuu           3  A,B,C
    

    【讨论】:

    • 谢谢,成功了。
    • 另外,我认为您的解决方案非常直接且高效,因为您无需重新计算。恭喜!
    猜你喜欢
    • 2021-12-05
    • 1970-01-01
    • 2020-09-09
    • 1970-01-01
    • 2022-07-18
    • 2018-10-31
    • 2018-12-16
    • 2021-03-29
    • 1970-01-01
    相关资源
    最近更新 更多