【问题标题】:Pandas rolling max for time series data时间序列数据的 Pandas 滚动最大值
【发布时间】:2021-04-27 02:15:54
【问题描述】:

在 Jupyter notebook 中的 2 个数据集上应用 rolling("1D").max() 时,我得到了 2 种不同的行为。

我需要计算每天的滚动最大值。

Sample:
df = pd.DataFrame({'B': [0, 4, 3, 3, 4, 2, 1, 2, 3, 4]},
                  index = [pd.Timestamp('20130101 09:00:00'),
                           pd.Timestamp('20130101 09:02:02'),
                           pd.Timestamp('20130101 09:03:03'),
                           pd.Timestamp('20130101 09:04:05'),
                           pd.Timestamp('20130101 09:15:06'),                          
                           pd.Timestamp('20130102 09:16:06'),
                           pd.Timestamp('20130102 09:17:06'),
                           pd.Timestamp('20130102 09:35:06'),
                           pd.Timestamp('20130102 09:36:06'),
                           pd.Timestamp('20130102 09:37:06')])

df.rolling("1D").max() #gives desired output

                        B
2013-01-01 09:00:00     0.0
2013-01-01 09:02:02     4.0
2013-01-01 09:03:03     4.0
2013-01-01 09:04:05     4.0
2013-01-01 09:15:06     4.0
2013-01-02 09:16:06     2.0 # <- 2 is the highest value for new day
2013-01-02 09:17:06     2.0
2013-01-02 09:35:06     2.0
2013-01-02 09:36:06     3.0
2013-01-02 09:37:06     4.0

当我尝试应用到我得到的实际数据时

# Sample data
data = '{"High":{"1611221400000":0.99615,"1611222300000":0.9751,"1611223200000":1.035,"1611224100000":0.9894,"1611225000000":1.385,"1611225900000":1.345,"1611226800000":1.235,"1611227700000":1.245,"1611228600000":1.315,"1611229500000":1.295,"1611230400000":1.28,"1611231300000":1.295,"1611232200000":1.415,"1611233100000":1.415,"1611234000000":1.355,"1611234900000":1.385,"1611235800000":1.335,"1611236700000":1.325,"1611237600000":1.365,"1611238500000":1.445,"1611239400000":1.515,"1611240300000":1.475,"1611241200000":1.405,"1611242100000":1.375,"1611243000000":1.255,"1611243900000":1.225,"1611307800000":1.375,"1611308700000":1.415,"1611309600000":1.495}}'
df2 = pd.read_json(data)

df2.rolling("1D").max()
# keeps rolling from previous day

    High
Date    
2021-01-21 09:30:00     0.99615
2021-01-21 09:45:00     0.99615
2021-01-21 10:00:00     1.03500
2021-01-21 10:15:00     1.03500
2021-01-21 10:30:00     1.38500
2021-01-21 10:45:00     1.38500
2021-01-21 11:00:00     1.38500
2021-01-21 11:15:00     1.38500
2021-01-21 11:30:00     1.38500
2021-01-21 11:45:00     1.38500
2021-01-21 12:00:00     1.38500
2021-01-21 12:15:00     1.38500
2021-01-21 12:30:00     1.41500
2021-01-21 12:45:00     1.41500
2021-01-21 13:00:00     1.41500
2021-01-21 13:15:00     1.41500
2021-01-21 13:30:00     1.41500
2021-01-21 13:45:00     1.41500
2021-01-21 14:00:00     1.41500
2021-01-21 14:15:00     1.44500
2021-01-21 14:30:00     1.51500
2021-01-21 14:45:00     1.51500
2021-01-21 15:00:00     1.51500
2021-01-21 15:15:00     1.51500
2021-01-21 15:30:00     1.51500
2021-01-21 15:45:00     1.51500
2021-01-22 09:30:00     1.51500 # <- value got rolled from previous day
2021-01-22 09:45:00     1.51500
2021-01-22 10:00:00     1.51500

熊猫版本 = 0.25.1

两个 DF 都有 DatetimeIndex, dtype='datetime64[ns]', freq=None

知道为什么会这样吗?

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    在这两种情况下,滚动窗口都会打开一天(等于 24 小时)的过滤器。

    我稍微修改了你的第一个示例,请查看输出:

    df = pd.DataFrame({'B': [0, 4, 3, 3, 4, 2, 1, 2, 3, 4]},
                      index = [pd.Timestamp('20130101 09:00:00'),
                               pd.Timestamp('20130101 09:02:02'),
                               pd.Timestamp('20130101 09:03:03'),
                               pd.Timestamp('20130101 09:04:05'),
                               pd.Timestamp('20130101 09:15:06'),                          
                               pd.Timestamp('20130102 09:13:06'), # <-- minus 3 minutes
                               pd.Timestamp('20130102 09:17:06'),
                               pd.Timestamp('20130102 09:35:06'),
                               pd.Timestamp('20130102 09:36:06'),
                               pd.Timestamp('20130102 09:37:06')])
    df.rolling("1D").max()
    >>> 
                           B
    2013-01-01 09:00:00  0.0
    2013-01-01 09:02:02  4.0
    2013-01-01 09:03:03  4.0
    2013-01-01 09:04:05  4.0
    2013-01-01 09:15:06  4.0
    2013-01-02 09:13:06  4.0 # <-- overlap of days
    2013-01-02 09:17:06  2.0
    2013-01-02 09:35:06  2.0
    2013-01-02 09:36:06  3.0
    2013-01-02 09:37:06  4.0
    

    这意味着rolling 在这两种情况下都在做同样的事情。

    如果你想获得每天的滚动最大值,你可能想做这样的事情:

    df = df.groupby(df.index.day).rolling('1D').max()
    

    df2 = df2.groupby(df2.index.day).rolling('1D').max()
    

    这将返回一个带有 MultiIndex 的 DataFrame。

    MultiIndex 可以在下一步中减少使用

    df.index = df.index.droplevel(0) 
    

    df2.index = df2.index.droplevel(0) 
    

    【讨论】:

    • 我明白了。现在这很有意义。
    【解决方案2】:

    一天的滚动窗口 ('1D') 不是从午夜到午夜,而是跨越 24 小时,与日期变化无关。你可以这样做:

    def fun(x):
        print(x.index[0], x.index[-1])
        return len(x)
    df2.rolling("1d").apply(fun)
    

    所以你需要的是df2.set_index(df2.index.normalize()).rolling("1d").max():

    df2.High = df2.set_index(df2.index.normalize()).rolling("1d").max().to_numpy()
    

    结果:

                            High
    2021-01-21 09:30:00  0.99615
    2021-01-21 09:45:00  0.99615
    2021-01-21 10:00:00  1.03500
    2021-01-21 10:15:00  1.03500
    2021-01-21 10:30:00  1.38500
    2021-01-21 10:45:00  1.38500
    2021-01-21 11:00:00  1.38500
    2021-01-21 11:15:00  1.38500
    2021-01-21 11:30:00  1.38500
    2021-01-21 11:45:00  1.38500
    2021-01-21 12:00:00  1.38500
    2021-01-21 12:15:00  1.38500
    2021-01-21 12:30:00  1.41500
    2021-01-21 12:45:00  1.41500
    2021-01-21 13:00:00  1.41500
    2021-01-21 13:15:00  1.41500
    2021-01-21 13:30:00  1.41500
    2021-01-21 13:45:00  1.41500
    2021-01-21 14:00:00  1.41500
    2021-01-21 14:15:00  1.44500
    2021-01-21 14:30:00  1.51500
    2021-01-21 14:45:00  1.51500
    2021-01-21 15:00:00  1.51500
    2021-01-21 15:15:00  1.51500
    2021-01-21 15:30:00  1.51500
    2021-01-21 15:45:00  1.51500
    2021-01-22 09:30:00  1.37500
    2021-01-22 09:45:00  1.41500
    2021-01-22 10:00:00  1.49500
    

    这比index.date 上的groupby 快大约 2-3 倍,然后删除额外的索引级别。

    另一种可能性是使用VariableOffsetWindowIndexernormalized DateOffset0 天,但这非常慢:

    indexer = pd.api.indexers.VariableOffsetWindowIndexer(index=df2.index, offset=pd.tseries.offsets.DateOffset(0, normalize=True))
    df2.rolling(indexer).max()
    

    【讨论】:

    • 感谢您的帖子。我也想在我的索引中保留时间
    • 这就是为什么我只是暂时把它删掉了。如果你想保持时间使用例如df2.High = df2.set_index(df2.index.normalize()).rolling("1d").max().to_numpy()
    猜你喜欢
    • 2021-09-23
    • 1970-01-01
    • 1970-01-01
    • 2023-02-25
    • 2023-01-31
    • 2021-01-07
    • 1970-01-01
    • 1970-01-01
    • 2017-10-09
    相关资源
    最近更新 更多