计算熊猫滚动偏移窗口的实际持续时间答案

【问题标题】：Calculate the actual duration of a pandas rolling offset window计算熊猫滚动偏移窗口的实际持续时间
【发布时间】：2020-11-18 23:26:14
【问题描述】：

Pandas 有一个rolling() 函数可以在 Series 和 DataFrame 对象的窗口上执行计算。如果索引是日期时间（或者您使用 on 参数引用日期时间列），则可以在偏移量上执行 rolling()，例如 2 秒或 7 天。

我想计算每个窗口的实际持续时间，而不是偏移量。我能想到的最好的方法是复制时间戳列，为索引设置一个，然后使用rolling() 获取最小值和最大值。但是，在调用 rolling() 后，新的 Timestamp 列会被删除。

import pandas as pd

df = pd.DataFrame({'B': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                  'Tm': [pd.Timestamp('20130101 09:00:00'),
                           pd.Timestamp('20130101 09:00:02'),
                           pd.Timestamp('20130101 09:00:03'),
                           pd.Timestamp('20130101 09:00:05'),
                           pd.Timestamp('20130101 09:00:06'),
                           pd.Timestamp('20130101 09:00:10'),
                           pd.Timestamp('20130101 09:00:12'),
                           pd.Timestamp('20130101 09:00:16'),
                           pd.Timestamp('20130101 09:00:19'),
                           pd.Timestamp('20130101 09:00:20')]})

df['t'] = df['Tm']
print(df)
max_times = df.rolling('5s', on='Tm').max()
min_times = df.rolling('5s', on='Tm').min()
print(max_times)
print((max_times - min_times).astype('timedelta64[s]'))

输出：

   B                  Tm                   t
0  0 2013-01-01 09:00:00 2013-01-01 09:00:00
1  1 2013-01-01 09:00:02 2013-01-01 09:00:02
2  2 2013-01-01 09:00:03 2013-01-01 09:00:03
3  3 2013-01-01 09:00:05 2013-01-01 09:00:05
4  4 2013-01-01 09:00:06 2013-01-01 09:00:06
5  5 2013-01-01 09:00:10 2013-01-01 09:00:10
6  6 2013-01-01 09:00:12 2013-01-01 09:00:12
7  7 2013-01-01 09:00:16 2013-01-01 09:00:16
8  8 2013-01-01 09:00:19 2013-01-01 09:00:19
9  9 2013-01-01 09:00:20 2013-01-01 09:00:20
     B                  Tm
0  0.0 2013-01-01 09:00:00
1  1.0 2013-01-01 09:00:02
2  2.0 2013-01-01 09:00:03
3  3.0 2013-01-01 09:00:05
4  4.0 2013-01-01 09:00:06
5  5.0 2013-01-01 09:00:10
6  6.0 2013-01-01 09:00:12
7  7.0 2013-01-01 09:00:16
8  8.0 2013-01-01 09:00:19
9  9.0 2013-01-01 09:00:20
         B   Tm
0 00:00:00  0.0
1 00:00:01  0.0
2 00:00:02  0.0
3 00:00:02  0.0
4 00:00:03  0.0
5 00:00:01  0.0
6 00:00:01  0.0
7 00:00:01  0.0
8 00:00:01  0.0
9 00:00:02  0.0

肯定有更优雅（和实用）的技术吗？

【问题讨论】：

标签： python pandas

【解决方案1】：

我通过以下方式实现了这一目标：

将时间戳列设置为索引，
定义一个函数，该函数接受一个 DataFrame（在这种情况下，是来自 rolling() 函数的片段），将索引转换为整数，并返回索引数组的最小值和最大值之间的差，
在 DataFrame 上调用 rolling() 并使用 apply() 函数，该函数可让您指定要使用的函数。

apply() 函数的文档在这里：https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.apply.html

例子：

import pandas as pd

df = pd.DataFrame({'B': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                  'Tm': [pd.Timestamp('20130101 09:00:00'),
                         pd.Timestamp('20130101 09:00:02'),
                         pd.Timestamp('20130101 09:00:03'),
                         pd.Timestamp('20130101 09:00:05'),
                         pd.Timestamp('20130101 09:00:06'),
                         pd.Timestamp('20130101 09:00:10'),
                         pd.Timestamp('20130101 09:00:12'),
                         pd.Timestamp('20130101 09:00:16'),
                         pd.Timestamp('20130101 09:00:19'),
                         pd.Timestamp('20130101 09:00:20')]})

def duration(X):
    ind = pd.to_numeric(X.index) * 10**-9 # Convert from nanoseconds to seconds. 
    return ind.max() - ind.min()

df = df.set_index("Tm")
print(df)
durations = df.rolling("5s").apply(duration) 
df.reset_index()
print(durations)

输出：

                     B
Tm                    
2013-01-01 09:00:00  0
2013-01-01 09:00:02  0
2013-01-01 09:00:03  0
2013-01-01 09:00:05  0
2013-01-01 09:00:06  0
2013-01-01 09:00:10  0
2013-01-01 09:00:12  0
2013-01-01 09:00:16  0
2013-01-01 09:00:19  0
2013-01-01 09:00:20  0
                       B
Tm                      
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  2.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  3.0
2013-01-01 09:00:06  4.0
2013-01-01 09:00:10  4.0
2013-01-01 09:00:12  2.0
2013-01-01 09:00:16  4.0
2013-01-01 09:00:19  3.0
2013-01-01 09:00:20  4.0

【讨论】：

虽然您的解决方案在给定示例中运行良好，但请记住，apply() 对于较大的数据帧会变得非常慢，因为该操作未矢量化。相反，只需将日期时间作为整数添加到数据帧中，然后通过减去 df.rolling('5s').max() 和 df.rolling('5s').min() 来计算持续时间。