获取 Pandas DataFrame 中每个类别的计数答案

【问题标题】：Getting a count for each category in a Pandas DataFrame获取 Pandas DataFrame 中每个类别的计数
【发布时间】：2022-01-06 22:20:07
【问题描述】：

我有一个类似于以下的数据框：

import pandas as pd
data = {"Name":["Andrew","Andrew","Andrew","Andrew","Andrew","Andrew","Andrew", "Sam", "Sam", "Sam", "Sam", "Sam"], "PASS":[0, 1, 1, 0, 1, 1, 1, 0, 1, 1,0,1]}
df = pd.DataFrame(data=data)

输出

    Name    PASS
0   Andrew  0
1   Andrew  1
2   Andrew  1
3   Andrew  0
4   Andrew  1
5   Andrew  1
6   Andrew  1
7   Sam     0
8   Sam     1
9   Sam     1
10  Sam     0
11  Sam     1

我想生成一个数据框，其中包含每个学生的最大连续通行证：

    Name    MAX_PASS
0   Andrew  3
1   Sam     2

我需要一些帮助来修改我目前拥有的代码。 count 正在输出 0110110110 和 result = 2。这不太正确。我想我已经接近了，但需要一些帮助才能越过终点线。谢谢。

count = ''
for i in range(len(df)-1):
    if df.Name[i] == df.Name[i+1]:
        if df.PASS[i] == 0:
            count += "0"
        else:
            count += "1"  
            result = len(max(count.split('0')))

【问题讨论】：

使用 groupby, agg df.groupby('Name').agg('sum') 。此处为您提供文档pandas.pydata.org/pandas-docs/stable/reference/api/…
仅供参考：彻底回答问题非常耗时。 如果您的问题得到解决，请通过接受最适合您的需求的解决方案表示感谢。 接受检查位于答案左上角的向上/向下箭头下方。如果出现更好的解决方案，则可以接受新的解决方案。您还可以使用向上或向下箭头对答案的质量/有用性进行投票。 如果解决方案不能回答问题，请发表评论。 What should I do when someone answers my question?。谢谢

标签： python pandas

【解决方案1】：

你可以考虑改编这个answer

def max_strike_group(x, col):
    x = x[col]
    a = x != 0
    out = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
    return out.max()

df.groupby("Name").apply(lambda x:max_strike_group(x, "PASS"))

Name
Andrew    3
Sam       2
dtype: int64

【讨论】：

【解决方案2】：

一个选项是调用 cumsum 两次，第一次是把 0 和 1 相加，第二次是获取复位后的值：

TL-DR：

cum1 = df.groupby('Name').PASS.cumsum()
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64

解释：

# first cumulative sum
cum1 = df.groupby('Name').PASS.cumsum()
cum1
0     0
1     1
2     2
3     2
4     3
5     4
6     5
7     0
8     1
9     2
10    2
11    3
Name: PASS, dtype: int64

# look for rows where the reset should occur
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
cum1
array([ 0,  1,  1, -2,  1,  1,  1,  0,  1,  1, -2,  1])

# build the max_pass column
# with the second cumsum and groupby
# before grouping again to get the max
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64

【讨论】：