Python时间序列 - 计算低于/高于和指定最短持续时间阈值的周期答案

【问题标题】：Python time series - count periods below/above and threshold for specified minimum durationPython时间序列 - 计算低于/高于和指定最短持续时间阈值的周期
【发布时间】：2021-08-15 11:52:50
【问题描述】：

在 pandas 时间序列中，我试图找到阈值与持续时间的组合度量。

例如，我们希望周期数 > 5 分钟，其中 ['pct'] 列低于 80

数据框如下所示：

timestamp	pct
27-05-2021 10:11	95
27-05-2021 10:12	94
27-05-2021 10:13	80
27-05-2021 10:14	94
27-05-2021 10:15	80
27-05-2021 10:16	80
27-05-2021 10:17	80
27-05-2021 10:18	80
27-05-2021 10:19	80
27-05-2021 10:20	91
27-05-2021 10:21	NaN
27-05-2021 10:22	80
27-05-2021 10:23	80
27-05-2021 10:24	80
27-05-2021 10:25	80
27-05-2021 10:26	94

因此需要识别 1 个周期（因为我们不关心包含 NaN 值）

Ben B 的帖子和 Alain T 的回答在这里得到了一些帮助： How to count consecutive periods in a timeseries above/below threshold?

我附上了一张来自 microsoft paint 的丑陋图片来说明问题

注意：这是一个相当大的数据框，所以我不确定迭代数据框是否是最好的主意，但非常感谢任何帮助。

【问题讨论】：

那么，最后你想得到数字1作为计数，或者一个过滤的数据帧，或者满足条件的数据帧行列表？
最后我只想要多少个周期满足条件的计数。但是如果我留下一个被过滤的数据框，我也可以从那里开始工作......
我认为一个简单的解决方案，我几乎有工作是通过简单的条件过滤，所以你得到布尔值，然后做 df.cumsum() 来计算分钟数，但我不知道如何获取计数，并在计数达到“错误”时重置计数
是的，类似的想法在这里，发表了一个答案希望它有帮助

标签： python pandas time-series data-science

【解决方案1】：

您可以对数据框中的连续 80 进行分组，然后使用列表推导检查每个组中的条件并获取其长度：

# first is `pct` column's threshold, other is minute threshold for `timestamp`
value_thre = 80
minute_thre = 3

# groupby by consecutive `value_thre`s
grouper = df.groupby(df.pct.le(value_thre).diff().ne(0).cumsum())

# look at the time difference between last and first timestamp
# also ensure no `pct` value exceeds the value threshold
condition = lambda gr: (gr.pct.max() <= value_thre
                        and gr.timestamp.iloc[-1] - gr.timestamp.iloc[0] > pd.Timedelta(f"{minute_thre} min"))

# filter the grouper and get the length
result = len([g for _, g in grouper if condition(g)])

得到

>>> result
1

【讨论】：

太棒了。我现在正在测试。非常感谢！
@JesperMølgaard 在这种情况下，我想说您可以将condition 中的gr.timestamp.iloc 替换为gr.index 以直接到达索引并在那里进行比较（根据我从您的评论中了解到的） .
如果失败，您可以分别尝试gr.timestamp[-1] 和0，即不使用iloc（根据我从回溯中了解到的情况）。
@JesperMølgaard 很高兴听到！其余的工作正常吗？
@JesperMølgaard 对于第二期，我认为将df.pct.eq 更改为df.pct.le 应该可以（从“equal”更改为“less than 或equal”） .对于NaN，据我所知，df = df.dropna() 在所有这些都应该起作用之前。你能澄清一下应该如何对待NaNs 吗？