在 pandas 中有效地聚合重新采样的日期时间集合答案

【问题标题】：Efficiently aggregate a resampled collection of datetimes in pandas在 pandas 中有效地聚合重新采样的日期时间集合
【发布时间】：2019-06-25 22:25:24
【问题描述】：

给定以下数据集作为 pandas 数据框 df：

index(as DateTime object) |  Name        |  Amount    |  IncomeOutcome
---------------------------------------------------------------
2019-01-28                |  Customer1   |  200.0     |  Income
2019-01-31                |  Customer1   |  200.0     |  Income
2019-01-31                |  Customer2   |  100.0     |  Income
2019-01-28                |  Customer2   |  -100.0    |  Outcome
2019-01-31                |  Customer2   |  -100.0    |  Outcome

我们执行以下步骤：

grouped = df.groupby("Name", "IncomeOutcome")
sampled_by_month = grouped.resample("M")
aggregated = sampled_by_month.agg({"MonthlyCount": "size", "Amount": "sum"})

所需的输出应如下所示：

Name       |  IncomeOutcome   |  Amount    |  MonthlyCount
------------------------------------------------------------
Customer1  |  Income          |  400.0     |  2
Customer2  |  Income          |  100.0     |  1
Customer2  |  Outcome         |  -200.0    |  2

最后一步表现很差，可能与Pandas Issue #20660有关我的第一个意图是将所有 datetime 对象转换为 int64，这给我留下了如何按月对转换后的数据重新采样的问题。

对这个问题有什么建议吗？

提前谢谢你

【问题讨论】：

嗨 Ben，您希望获得的最终数据形式是什么？你能做一个你想要的输出的小数据框例子吗？谢谢。
嘿奥利，我更新了描述。我希望这能让我更清楚我想要实现的目标。 sampled_by_month 变量应该保存以日期时间对象数组作为值的组，这似乎很慢。
修复了输入中可能存在的错误。请确认。
是的，似乎是正确的。谢谢！

标签： python pandas performance numpy

【解决方案1】：

也许我们可以通过仅对单个列（“数量”，感兴趣的列）进行重新采样来优化您的解决方案。

(df.groupby(["Name", "IncomeOutcome"])['Amount']
   .resample("M")
   .agg(['sum','size'])
   .rename({'sum':'Amount', 'size': 'MonthlyCount'}, axis=1)
   .reset_index(level=-1, drop=True)
   .reset_index())

        Name IncomeOutcome  Amount  MonthlyCount
0  Customer1        Income   400.0             2
1  Customer2        Income   100.0             1
2  Customer2       Outcome  -200.0             2

如果这仍然太慢，那么我认为问题可能是resample 在在 groupby 中会减慢速度。也许您可以尝试通过单个 groupby 调用按所有 3 个谓词进行分组。对于日期重采样，请尝试pd.Grouper。

(df.groupby(['Name', 'IncomeOutcome', pd.Grouper(freq='M')])['Amount']
   .agg([ ('Amount', 'sum'), ('MonthlyCount', 'size')])
   .reset_index(level=-1, drop=True)
   .reset_index())

        Name IncomeOutcome  Amount  MonthlyCount
0  Customer1        Income   400.0             2
1  Customer2        Income   100.0             1
2  Customer2       Outcome  -200.0             2

在性能方面，这应该会更快。

性能

让我们尝试设置一个更通用的 DataFrame 来进行测试。

# Setup
df_ = df.copy()
df1 = pd.concat([df_.reset_index()] * 100, ignore_index=True)
df = pd.concat([
        df1.replace({'Customer1': f'Customer{i}', 'Customer2': f'Customer{i+1}'}) 
        for i in range(1, 98, 2)], ignore_index=True) 
df = df.set_index('index')

df.shape
# (24500, 3)

%%timeit 
(df.groupby(["Name", "IncomeOutcome"])['Amount']
   .resample("M")
   .agg(['sum','size'])
   .rename({'sum':'Amount', 'size': 'MonthlyCount'}, axis=1)
   .reset_index(level=-1, drop=True)
   .reset_index())

%%timeit
(df.groupby(['Name', 'IncomeOutcome', pd.Grouper(freq='M')])['Amount']
   .agg([ ('Amount', 'sum'), ('MonthlyCount', 'size')])
   .reset_index(level=-1, drop=True)
   .reset_index())

1.71 s ± 85.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
24.2 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：

我在 2 天前使用一般数据对其进行了测试，并为您的第二个代码获得了不同的输出
@jezrael 啊，是的，我认为他们的问题存在错误。 I fixed it 并 ping 他们要求他们检查它是否正确。
@coldspeed 感谢您的回答！我明天会检查它并回复你
@Ben 让我知道是否有任何事情。谢谢。
澄清未来读者的一点：使用 pd.Grouper 的第二种方法比第一种方法快得多，它会生成一个没有 NaN 行的 Dataframe。示例：如果您对每日数据进行重新采样，则缺失的天数将不会显示为 NaN 行。只有传统的“重新采样（）”会产生缺少天数、周数等的行。我相信这就是总行数与这些方法不同的原因。