Pandas 数据框每组 5 秒间隔的滚动值差异答案

【问题标题】：Pandas dataframe rolling difference in value for 5 second intervals per groupPandas 数据框每组 5 秒间隔的滚动值差异
【发布时间】：2019-03-30 05:55:11
【问题描述】：

我有一个 Pandas 数据框，其中包含时间戳（不均匀间隔）、序列号、类别和百分比形成。序列号仅用于对多行具有相同时间戳和类别的行进行排序，并在排序后删除。

|----------------------------------------------------------------|
|                        | seq_no   | category   | pct_formation |
|----------------------------------------------------------------|
|ts_timestamp            |          |            |               |
|----------------------------------------------------------------|
|2018-10-22 10:13:44.043 | 6839262  | in_petr    | 37.070000     |
|2018-10-22 10:17:09.527 | 7257908  | in_petr    | 36.970000     |
|2018-10-22 10:17:43.977 | 7319000  | in_dsh     | 36.950000     |
|2018-10-22 10:17:43.963 | 7318885  | in_dsh     | 36.960000     |
|2018-10-22 10:17:09.527 | 7257918  | in_petr    | 32.960000     |
|2018-10-22 10:19:44.040 | 7585354  | out_petr   | 36.890000     |
|2018-10-22 10:19:44.043 | 7585461  | out_petr   | 36.900000     |
|2018-10-22 10:19:37.267 | 7563817  | sync       | 33.910000     |
|2018-10-22 10:19:44.057 | 7586045  | sync       | 36.960000     |
|2018-10-22 10:19:16.750 | 7516841  | out_petr   | 36.880000     |
|2018-10-22 10:20:03.160 | 7637889  | sync       | 36.980000     |
|2018-10-22 10:20:32.350 | 7691592  | sync       | 37.000000     |
|2018-10-22 10:23:03.150 | 8008804  | sync       | 34.580000     |
|2018-10-22 10:22:18.633 | 7907782  | in_dsh     | 36.980000     |
|2018-10-22 10:25:39.557 | 8290932  | in_dsh     | 36.970000     |
|----------------------------------------------------------------|

我想在每天 10:00:00 到 11:00:00 之间每五秒更改一次每个类别的 pct_formation。

到目前为止，我已经尝试过：

df.sort_index()[['category', 'pct_formation']] \
.groupby(['category', df.index.date])
.rolling('5s').pct_formation.mean()

我使用按日期分组，因为我怀疑如果我按原样使用时间戳，那么分组结果将没有多大意义，因为时间戳间隔不均匀且非常精细。

如何在 10:00:00 和 11:00:00 之间获得均匀间隔的 5 秒窗口（例如：10:00:00 到 10:00:05、10:00:01 到 10:00： 06 等）。而且，如何获得每 5 秒窗口的开始和结束之间的 pct_formation 差异？

如果我在 rolling() 之后使用 min() 和 max() 之类的函数，我会收到一些错误，例如：

ValueError: could not convert string to float: 'out_petr'
TypeError: cannot handle this type -> object

请指导我如何进行，我将不胜感激。 TIA。

编辑：根据 cmets 中的反馈添加详细信息。

我想要一个滚动窗口，所以 10:00:00 到 10:00:05 之后的下一个窗口是 10:00:01 到 10:00:06，然后是 10:00:02 到 10:00 :07，等等。

我想看看pct_formation值从一个窗口到另一个窗口变化了多少，所以如果同一区间有多个值，我会使用mean()。

我想我必须使用.resample() 才能在每天上午 10 点和上午 11 点之间获得均匀间隔，但我发现很难理解。

我意识到我可以创建定期间隔的时间窗口，例如：

pd.date_range(start=df.index.min().replace(hour=10, minute=0, second=0, microsecond=0),
              end=df.index.max().replace(hour=11, minute=0, second=0, microsecond=0),
              freq='5S')

但是，我不知道如何更改我的数据框以符合每个类别的这些时间。

【问题讨论】：

你的问题不是很清楚 - 你想要 5 秒的窗口 10:00:05 到 10:00:10 到 10:00:15 等，10:00:01 到 10 怎么样:00:06 适合吗？这是两个不同的 5 秒间隔。如果您在同一窗口和类别中有两次时间，您会怎么做？平均他们？如果某个类别的 5 秒窗口中没有数据怎么办？
@flyingmeatball 我想要一个滚动窗口，所以 10:00:00 到 10:00:05 之后的下一个窗口是 10:00:01 到 10:00:06.. 我想要查看 pct_formation 值从一个窗口到另一个窗口变化了多少，所以平均值会很好。我会将详细信息编辑到问题中。谢谢指出！

标签： python pandas dataframe time-series window-functions

【解决方案1】：

IIUC，你可以使用resample()和rolling()：

df['ts_timestamp'] = pd.to_datetime(df['ts_timestamp'], format='%Y-%m-%d %H:%M:%S')

resampled = df.groupby('category').apply(lambda x: x.drop_duplicates('ts_timestamp').set_index('ts_timestamp').resample('1S').ffill())

resampled['pct_formation'].rolling(5).apply(lambda x: x[0]-x[-1], raw=True)

产量（一个简短的样本）：

category  ts_timestamp       
in_dsh    2018-10-22 10:17:43    NaN
          2018-10-22 10:17:44    NaN
          2018-10-22 10:17:45    NaN
          2018-10-22 10:17:46    NaN
          2018-10-22 10:17:47    NaN
          2018-10-22 10:17:48    0.0
          2018-10-22 10:17:49    0.0
          2018-10-22 10:17:50    0.0
          2018-10-22 10:17:51    0.0
          2018-10-22 10:17:52    0.0
          2018-10-22 10:17:53    0.0
          2018-10-22 10:17:54    0.0
          2018-10-22 10:17:55    0.0
...

我暂时只是用ffill()来填充比较稀疏的数据，不过你也可以考虑插值等。

【讨论】：

这可行，但不会在上午 10 点到 11 点之间提供窗口。只给出类别有时间戳的窗口。有什么可以修复的吗？
您是否有理由希望每个类别有 3600 行大部分为 0，因为在某些情况下它们仅由 3 点通知？似乎是一种奇怪的方法，除非我错过了什么。
我想测量和比较同一时间段内 pct_formation 的平均变化。更好的方法是什么？
你真的应该在你的问题中包含你想要的输出。