如果日期在被评估的当前行之前，则分组依据中的行计数答案

【问题标题】：Row Count in Group By if Date is Before the Current Row being Evaluated如果日期在被评估的当前行之前，则分组依据中的行计数
【发布时间】：2021-07-15 00:53:37
【问题描述】：

我想通过customer_id 计算num_opens_at_campaign_send。这取决于客户在发送每个活动之前打开的活动数量。

我无法找出在 pandas 中执行此操作的最佳方法，因此我们将不胜感激。我正在考虑使用 groupby customer_id 和 apply 函数将每个 campaign_sent 日期与该列中的所有其他日期进行比较，但我不确定获取行数以计算活动数量的精确方法每次发送广告活动时，客户已打开。

数据框如下：

customer_id	campaign_id	campaign_sent	opened
a	1234	2021-01-01	True
b	1234	2021-01-01	True
c	1234	2021-01-01	False
a	2222	2021-02-01	True
b	2222	2021-02-01	False
c	2222	2021-02-01	True
a	3333	2021-03-01	True
b	3333	2021-03-01	False
c	3333	2021-03-01	True

想要的输出是：

customer_id	campaign_id	campaign_sent	num_opens_at_campaign_send
a	1234	2021-01-01	0
b	1234	2021-01-01	0
c	1234	2021-01-01	0
a	2222	2021-02-01	1
b	2222	2021-02-01	1
c	2222	2021-02-01	0
a	3333	2021-03-01	2
b	3333	2021-03-01	1
c	3333	2021-03-01	1

所以对于第一个广告系列，num_opens_at_campaign_send 全部为 0，因为之前没有广告系列。

例如，customer_id 'b' 在发送 campaign_id 3333 时打开了 1 封邮件，因为他们打开了第一个活动 (1234) 但没有打开第二个活动 (2222) 电子邮件。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

您可以使用.groupby.transform 来计算num_opens_at_campaign_send 列。第一组由customer_id 组成，现在num_opens_at_campaign_send 连续是cumulative sum of opened till the row - value of opened in that row。

为确保cumulative sum 以正确的日期顺序计算，首先按campaign_sent 列对数据框进行排序。

用途：

df = df.sort_values(by = ['campaign_sent'])
df['num_opens_at_campaign_send'] = ( df.groupby('customer_id')['opened']
                                       .cumsum() - df.opened)
df.drop(columns = 'opened', inplace = True)

输出：

>>> df
  customer_id  campaign_id campaign_sent  num_opens_at_campaign_send
0           a         1234    2021-01-01                           0
1           b         1234    2021-01-01                           0
2           c         1234    2021-01-01                           0
3           a         2222    2021-02-01                           1
4           b         2222    2021-02-01                           1
5           c         2222    2021-02-01                           0
6           a         3333    2021-03-01                           2
7           b         3333    2021-03-01                           1
8           c         3333    2021-03-01                           1

【讨论】：

您必须确保 df 在campaign_sent 列上按升序排序才能正常工作？
感谢您的帮助。如果你想要num_opens_at_campaign_send 每customer_id 的最后n 天，这将如何完成？例如最近 30 天？某种形式的滚动功能？我再次努力寻找正确的方法。
我认为需要重新采样，因为数据确实有每天的记录
这可能会有所帮助：Time aware rolling in pandas