【问题标题】:Pandas groupby: Count the number of occurrences within a time range for each groupPandas groupby:统计每个组在某个时间范围内出现的次数
【发布时间】:2015-12-13 05:50:32
【问题描述】:

我有一个数据框:

ID  DATE       WIN
A   2015/6/5   Yes
A   2015/6/7   Yes
A   2015/6/7   Yes
A   2015/6/7   Yes
B   2015/6/8   No
B   2015/8/7   Yes
C   2015/5/15  Yes
C   2015/5/30  No
C   2015/7/30  No
C   2015/8/03  Yes

我想添加一个列来计算过去 1 个月内每个 ID 的获胜次数,因此结果将如下所示:

ID  DATE       WIN  NumOfDaysSinceLastWin NumOfWinsInThePast30days
A   2015/6/5   Yes           0               0       
A   2015/6/7   Yes           2               1 
A   2015/6/7   Yes           2               1 or (A 2015/6/7 Yes 0 2)
A   2015/6/8   No            1               3 
B   2015/8/7   No            0               0
B   2015/8/7   Yes           0               0
C   2015/5/15  Yes           0               0
C   2015/5/30  No            15              1
C   2015/7/30  No            76              0
C   2015/8/03  Yes           80              0

如何使用groupby 函数和timegrouper 来获得这个?

【问题讨论】:

  • A 第三次获胜,NumOfDaysSinceLastWin 不应该是 0 吗?该日期与上一次获胜的日期相同。不管怎样,专栏的标题应该是NumOfDaysSinceFirstWin吧?
  • @dawg 有两种解释。如果两场比赛同时发生在同一天,那么距离上一场胜利的天数将为 2。如果第二场胜利发生在第三场胜利前一小时,那么第三行距离上一场胜利的天数将是0,这是我放在括号中的结果。两种解释都有效。它应该是自上次获胜以来的天数,而不是第一次获胜。

标签: python pandas group-by date-difference rolling-sum


【解决方案1】:

输入数据必须按DATE在每个组中排序,在这个数据中就可以了。
输入数据不能很好地映射情况,因此将添加接下来的 4 行。

WIN1 是从WIN 创建的 - 值1 对应'Yes'0 对应'No'。我需要它用于两个输出列。

df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)

NumOfDaysSinceLastWin

首先创建列cumsum(累计)。

df['cumsum'] = df['WIN1'].cumsum()

如果所有WIN 都是'Yes',那就很简单了。数据将被分组,日期和前一个日期(-1)值之间的差异在列diffs

#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0)) 

但是情况很复杂,因为WIN列的值'No'。因此,如果值为'Yes',则需要与previous 'Yes' 区别,如果'No' 需要与last previous 'WIN' 区别。可以通过多种方式计算差异,但通过减去两列来选择 - DATE 和列 date1

专栏date1
行必须以特殊方式分组 - 值 'No' 和最后一个 'Yes'。通过cumsum 列中的累积总和是可能的。 然后这个组的最小值是'Yes' 列的值,然后这个值重复到具有'No' 值的行。 count 列是特殊的 - cumsum 列没有重复值是 1。重复的按组递增。

df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')

前几行中值'YES' 的日期对于差异是必要的。数据框df1 仅过滤 df 的值'Yes',然后按列ID 对其进行分组。索引不变,所以输出可以映射到数据框df的新列。

df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) 
print df

   ID       DATE  WIN  WIN1  cumsum        min  count      date1
0   A 2015-06-05  Yes     1       1 2015-06-05      1        NaT
1   A 2015-06-05  Yes     1       2 2015-06-05      1 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      1 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      1 2015-06-07
4   A 2015-06-07  Yes     1       5 2015-06-07      4 2015-06-07
5   A 2015-06-08   No     0       5 2015-06-07      4        NaT
6   B 2015-06-07   No     0       5 2015-06-07      4        NaT
7   B 2015-06-07   No     0       5 2015-06-07      4        NaT
8   B 2015-08-07  Yes     1       6 2015-08-07      1        NaT
9   C 2015-05-15  Yes     1       7 2015-05-15      3        NaT
10  C 2015-05-30   No     0       7 2015-05-15      3        NaT
11  C 2015-07-30   No     0       7 2015-05-15      3        NaT
12  C 2015-08-03  Yes     1       8 2015-08-03      1 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      1 2015-08-03

然后日期列min(值'No'和上一个上一个'Yes')和列date1(其他值'Yes')可以通过列count连接。
添加了新条件 - date1 列的值将为空 - (NaT),因为这些值将被列 min 覆盖。

df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df

   ID       DATE  WIN  WIN1  cumsum        min  count      date1
0   A 2015-06-05  Yes     1       1 2015-06-05      1 2015-06-05
1   A 2015-06-05  Yes     1       2 2015-06-05      1 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      1 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      1 2015-06-07
4   A 2015-06-07  Yes     1       5 2015-06-07      4 2015-06-07
5   A 2015-06-08   No     0       5 2015-06-07      4 2015-06-07
6   B 2015-06-07   No     0       5 2015-06-07      4 2015-06-07
7   B 2015-06-07   No     0       5 2015-06-07      4 2015-06-07
8   B 2015-08-07  Yes     1       6 2015-08-07      1 2015-08-07
9   C 2015-05-15  Yes     1       7 2015-05-15      3 2015-05-15
10  C 2015-05-30   No     0       7 2015-05-15      3 2015-05-15
11  C 2015-07-30   No     0       7 2015-05-15      3 2015-05-15
12  C 2015-08-03  Yes     1       8 2015-08-03      1 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      1 2015-08-03

重复日期时间 - 子解决方案
抱歉,如果这样做很复杂,也许有人会找到更好的。
我的解决方案是找到重复值,用上一个 'Yes' 填充它们并添加到列 date1 以获取差异。
这些值在count 列中标识。其他(值1)重置为NaN。然后date1 中的值将通过count 列复制到date2

df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df    

   ID       DATE  WIN  WIN1  cumsum        min  count      date1      date2
0   A 2015-06-05  Yes     1       1 2015-06-05      2 2015-06-05 2015-06-05
1   A 2015-06-05  Yes     1       2 2015-06-05      2 2015-06-05 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      3 2015-06-05 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      3 2015-06-07 2015-06-07
4   A 2015-06-07  Yes     1       5 2015-06-07      3 2015-06-07 2015-06-07
5   A 2015-06-08   No     0       5 2015-06-07    NaN 2015-06-07        NaT
6   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
7   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
8   B 2015-08-07  Yes     1       6 2015-08-07    NaN 2015-08-07        NaT
9   C 2015-05-15  Yes     1       7 2015-05-15    NaN 2015-05-15        NaT
10  C 2015-05-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
11  C 2015-07-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
12  C 2015-08-03  Yes     1       8 2015-08-03      2 2015-05-15 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      2 2015-08-03 2015-08-03 

然后这些值将按组的最小值重复并添加到date1 列。

def repeat_value(grp):
    grp['date2'] = grp['date2'].min()
    return grp

df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df1['date2'].notnull() , 'date1'] = df['date2']
print df

   ID       DATE  WIN  WIN1  cumsum        min  count      date1      date2
0   A 2015-06-05  Yes     1       1 2015-06-05      2 2015-06-05 2015-06-05
1   A 2015-06-05  Yes     1       2 2015-06-05      2 2015-06-05 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      3 2015-06-05 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      3 2015-06-05 2015-06-05
4   A 2015-06-07  Yes     1       5 2015-06-07      3 2015-06-05 2015-06-05
5   A 2015-06-08   No     0       5 2015-06-07    NaN 2015-06-07        NaT
6   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
7   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
8   B 2015-08-07  Yes     1       6 2015-08-07    NaN 2015-08-07        NaT
9   C 2015-05-15  Yes     1       7 2015-05-15    NaN 2015-05-15        NaT
10  C 2015-05-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
11  C 2015-07-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
12  C 2015-08-03  Yes     1       8 2015-08-03      2 2015-05-15 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      2 2015-05-15 2015-05-15 

NumOfDaysSinceLastWin 列由date1DATE 的差异填充。数据类型为Timedelta,因此将转换为整数。最后不必要的列将被删除。 (只有 WIN1count 列是下一个输出列所必需的,因此不会被删除。)

df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]')
df = df.drop(['cumsum','min', 'date1'], axis=1 )
print df
   ID       DATE  WIN  WIN1  count  NumOfDaysSinceLastWin
0   A 2015-06-05  Yes     1      2                      0
1   A 2015-06-05  Yes     1      2                      0
2   A 2015-06-07  Yes     1      3                      2
3   A 2015-06-07  Yes     1      3                      2
4   A 2015-06-07  Yes     1      3                      2
5   A 2015-06-08   No     0    NaN                      1
6   B 2015-06-07   No     0    NaN                      0
7   B 2015-06-07   No     0    NaN                      0
8   B 2015-08-07  Yes     1    NaN                      0
9   C 2015-05-15  Yes     1    NaN                      0
10  C 2015-05-30   No     0    NaN                     15
11  C 2015-07-30   No     0    NaN                     76
12  C 2015-08-03  Yes     1      2                     80
13  C 2015-08-03  Yes     1      2                     80

NumOfWinsInThePast30days

滚动总和是您的朋友。 yes 列(重采样所必需)由 1 映射为 'Yes'NaN 映射为 'No'

df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)

数据框df2df 的副本,DATE 列设置为索引(用于重采样)。不必要的列将被删除。

df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)

然后df2 将按天重新采样,如果行是'Yes',值是1,如果'No'0。 (最好看下面的解释。)

df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()

数据框df2 将按ID 分组,函数rolling_sum 用于这些组。

df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)

为了更好地理解df2的所有行将被显示。

with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
    print df2

    ID       DATE  yes  rollsum
0    A 2015-06-05    2        2
1    A 2015-06-06    0        2
2    A 2015-06-07    3        5
3    A 2015-06-08    0        5
4    B 2015-06-07    0        0
5    B 2015-06-08    0        0
6    B 2015-06-09    0        0
7    B 2015-06-10    0        0
8    B 2015-06-11    0        0
9    B 2015-06-12    0        0
10   B 2015-06-13    0        0
11   B 2015-06-14    0        0
12   B 2015-06-15    0        0
13   B 2015-06-16    0        0
14   B 2015-06-17    0        0
15   B 2015-06-18    0        0
16   B 2015-06-19    0        0
17   B 2015-06-20    0        0
18   B 2015-06-21    0        0
19   B 2015-06-22    0        0
20   B 2015-06-23    0        0
21   B 2015-06-24    0        0
22   B 2015-06-25    0        0
23   B 2015-06-26    0        0
24   B 2015-06-27    0        0
25   B 2015-06-28    0        0
26   B 2015-06-29    0        0
27   B 2015-06-30    0        0
28   B 2015-07-01    0        0
29   B 2015-07-02    0        0
30   B 2015-07-03    0        0
31   B 2015-07-04    0        0
32   B 2015-07-05    0        0
33   B 2015-07-06    0        0
34   B 2015-07-07    0        0
35   B 2015-07-08    0        0
36   B 2015-07-09    0        0
37   B 2015-07-10    0        0
38   B 2015-07-11    0        0
39   B 2015-07-12    0        0
40   B 2015-07-13    0        0
41   B 2015-07-14    0        0
42   B 2015-07-15    0        0
43   B 2015-07-16    0        0
44   B 2015-07-17    0        0
45   B 2015-07-18    0        0
46   B 2015-07-19    0        0
47   B 2015-07-20    0        0
48   B 2015-07-21    0        0
49   B 2015-07-22    0        0
50   B 2015-07-23    0        0
51   B 2015-07-24    0        0
52   B 2015-07-25    0        0
53   B 2015-07-26    0        0
54   B 2015-07-27    0        0
55   B 2015-07-28    0        0
56   B 2015-07-29    0        0
57   B 2015-07-30    0        0
58   B 2015-07-31    0        0
59   B 2015-08-01    0        0
60   B 2015-08-02    0        0
61   B 2015-08-03    0        0
62   B 2015-08-04    0        0
63   B 2015-08-05    0        0
64   B 2015-08-06    0        0
65   B 2015-08-07    1        1
66   C 2015-05-15    1        1
67   C 2015-05-16    0        1
68   C 2015-05-17    0        1
69   C 2015-05-18    0        1
70   C 2015-05-19    0        1
71   C 2015-05-20    0        1
72   C 2015-05-21    0        1
73   C 2015-05-22    0        1
74   C 2015-05-23    0        1
75   C 2015-05-24    0        1
76   C 2015-05-25    0        1
77   C 2015-05-26    0        1
78   C 2015-05-27    0        1
79   C 2015-05-28    0        1
80   C 2015-05-29    0        1
81   C 2015-05-30    0        1
82   C 2015-05-31    0        1
83   C 2015-06-01    0        1
84   C 2015-06-02    0        1
85   C 2015-06-03    0        1
86   C 2015-06-04    0        1
87   C 2015-06-05    0        1
88   C 2015-06-06    0        1
89   C 2015-06-07    0        1
90   C 2015-06-08    0        1
91   C 2015-06-09    0        1
92   C 2015-06-10    0        1
93   C 2015-06-11    0        1
94   C 2015-06-12    0        1
95   C 2015-06-13    0        1
96   C 2015-06-14    0        0
97   C 2015-06-15    0        0
98   C 2015-06-16    0        0
99   C 2015-06-17    0        0
100  C 2015-06-18    0        0
101  C 2015-06-19    0        0
102  C 2015-06-20    0        0
103  C 2015-06-21    0        0
104  C 2015-06-22    0        0
105  C 2015-06-23    0        0
106  C 2015-06-24    0        0
107  C 2015-06-25    0        0
108  C 2015-06-26    0        0
109  C 2015-06-27    0        0
110  C 2015-06-28    0        0
111  C 2015-06-29    0        0
112  C 2015-06-30    0        0
113  C 2015-07-01    0        0
114  C 2015-07-02    0        0
115  C 2015-07-03    0        0
116  C 2015-07-04    0        0
117  C 2015-07-05    0        0
118  C 2015-07-06    0        0
119  C 2015-07-07    0        0
120  C 2015-07-08    0        0
121  C 2015-07-09    0        0
122  C 2015-07-10    0        0
123  C 2015-07-11    0        0
124  C 2015-07-12    0        0
125  C 2015-07-13    0        0
126  C 2015-07-14    0        0
127  C 2015-07-15    0        0
128  C 2015-07-16    0        0
129  C 2015-07-17    0        0
130  C 2015-07-18    0        0
131  C 2015-07-19    0        0
132  C 2015-07-20    0        0
133  C 2015-07-21    0        0
134  C 2015-07-22    0        0
135  C 2015-07-23    0        0
136  C 2015-07-24    0        0
137  C 2015-07-25    0        0
138  C 2015-07-26    0        0
139  C 2015-07-27    0        0
140  C 2015-07-28    0        0
141  C 2015-07-29    0        0
142  C 2015-07-30    0        0
143  C 2015-07-31    0        0
144  C 2015-08-01    0        0
145  C 2015-08-02    0        0
146  C 2015-08-03    2        2

将删除不必要的列yes

df2 = df2.drop(['yes'], axis=1 )

输出正在与第一个数据帧 df 合并。

df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2

  ID       DATE  WIN  WIN1  NumOfDaysSinceLastWin  yes  rollsum
0  A 2015-06-07  Yes     1                      0    1        2
1  A 2015-06-07  Yes     1                      0    1        2
2  B 2015-08-07   No     0                      0  NaN        1
3  B 2015-08-07  Yes     1                      0    1        1
4  C 2015-05-15  Yes     1                      0    1        1
5  C 2015-05-30   No     0                     15  NaN        1
6  C 2015-07-30   No     0                     76  NaN        0
7  C 2015-08-03  Yes     1                     80    1        1

如果count 列中的值不是null,它们将被添加到count 列。 函数rolling_sum 是原始df 的行数,其值为'YES',因此必须减去。这个值(1)在WIN1列中。

df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']

删除不必要的列。

df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
   ID       DATE  WIN  NumOfDaysSinceLastWin  NumOfWinsInThePast30days
0   A 2015-06-05  Yes                      0                         0
1   A 2015-06-05  Yes                      0                         0
2   A 2015-06-07  Yes                      2                         2
3   A 2015-06-07  Yes                      2                         2
4   A 2015-06-07  Yes                      2                         2
5   A 2015-06-08   No                      1                         5
6   B 2015-06-07   No                      0                         0
7   B 2015-06-07   No                      0                         0
8   B 2015-08-07  Yes                      0                         0
9   C 2015-05-15  Yes                      0                         0
10  C 2015-05-30   No                     15                         1
11  C 2015-07-30   No                     76                         0
12  C 2015-08-03  Yes                     80                         0
13  C 2015-08-03  Yes                     80                         0

最后在一起:

import pandas as pd
import numpy as np
import io

#original data
temp=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes"""

#changed repeating data
temp2=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes
C,2015/8/03,Yes"""

df = pd.read_csv(io.StringIO(temp2), parse_dates = [1])

df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
df['cumsum'] = df['WIN1'].cumsum()

#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0)) 

df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')

df1 = df[~df['WIN'].isin(['No'])]

df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) 
print df
df.loc[(df['count'] >= 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df

#resolve repeating datetimes
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df

def repeat_value(grp):
    grp['date2'] = grp['date2'].min()
    return grp

df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df['date2'].notnull() , 'date1'] = df['date2']
print df

df['NumOfDaysSinceLastWin'] = (df['DATE'] - df['date1']).astype('timedelta64[D]')
df = df.drop(['cumsum','min','date1', 'date2'], axis=1 )
print df

#NumOfWinsInThePast30days
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1','count'], axis=1)

df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()

df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)

#with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
    #print df2

df2 = df2.drop(['yes'], axis=1 )
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2

df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']

df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2

【讨论】:

  • 谢谢!但有些奇怪。只要当前结果为“是”,NumOfDaysSinceLastWin 就会返回 0。它不应该考虑当前的胜利。例如,索引 7 的 NumOfDaysSinceLastWin 应该是 79 而不是 0。
  • 我认为 df['min'] = df.groupby(['cumsum'])['DATE'].transform('min') 应该是 df['min'] = df .groupby(['ID','cumsum'])['DATE'].transform('min') 对吧?
  • 而 df2['rollsum'] 可能会爆炸我的记忆,因为我有很多 id (500,000+)。
  • 哎呀抱歉应该是 80 而不是 79
  • 另外,如果我如上所示更改输入的前 4 行,您的结果似乎不正确..
猜你喜欢
  • 2017-09-15
  • 2021-11-29
  • 1970-01-01
  • 2014-04-21
  • 1970-01-01
  • 2020-06-20
  • 2014-09-20
  • 2010-11-26
相关资源
最近更新 更多