【问题标题】:Missing data range Pandas dataframe comparison Python缺少数据范围 Pandas 数据框比较 Python
【发布时间】:2021-12-14 22:49:00
【问题描述】:

如何编写代码来输出datesdata 之间的差异。 data 代码中缺少数据点,其中 1 分钟时间范围的 dates 数据帧中存在跳过。例如,在2015-10-08 13:53:00 之后有 6 个数据点丢失,因此它将其打印为'2015-10-08 13:54:00', '2015-10-08 14:00:00' 输出丢失的data 的范围。将其记录在Expected Output 中的二维数组中。我将如何编写产生预期输出的函数。

import pandas as pd 
import datetime 

dates = pd.date_range("2015-10-08 13:40:00", "2015-10-08 14:12:00", freq="1min")
data = pd.to_datetime(['2015-10-08 13:41:00',
               '2015-10-08 13:42:00', '2015-10-08 13:43:00',
               '2015-10-08 13:44:00', '2015-10-08 13:45:00',
               '2015-10-08 13:46:00', '2015-10-08 13:47:00',
               '2015-10-08 13:48:00', '2015-10-08 13:49:00',
               '2015-10-08 13:50:00', '2015-10-08 13:51:00',
               '2015-10-08 13:52:00', '2015-10-08 13:53:00',
               '2015-10-08 13:54:00', '2015-10-08 14:01:00',
               '2015-10-08 14:02:00', '2015-10-08 14:03:00',
               '2015-10-08 14:04:00', '2015-10-08 14:05:00',
               '2015-10-08 14:06:00', '2015-10-08 14:07:00',
               '2015-10-08 14:10:00', '2015-10-08 14:11:00',
               '2015-10-08 14:12:00'])

预期输出:

[['2015-10-08 13:40:00'], 
 ['2015-10-08 13:54:00', '2015-10-08 14:00:00'],
 ['2015-10-08 14:08:00', '2015-10-08 14:09:00']]

【问题讨论】:

标签: python pandas numpy datetime time


【解决方案1】:

datesdata 都是日期时间索引。你可以使用pd.Index.difference来区分这些

In [55]: s = pd.Series(dates.difference(data))
    ...: s # sort if needed
Out[55]:
0   2015-10-08 13:40:00
1   2015-10-08 13:55:00
2   2015-10-08 13:56:00
3   2015-10-08 13:57:00
4   2015-10-08 13:58:00
5   2015-10-08 13:59:00
6   2015-10-08 14:00:00
7   2015-10-08 14:08:00
8   2015-10-08 14:09:00
dtype: datetime64[ns]

In [56]: groups_diff_ne_1min = s.diff().fillna(pd.Timedelta(seconds=60)) != pd.Timedelta(seconds=60)
    ...: groups_diff_ne_1min
Out[56]:
0    False
1     True
2    False
3    False
4    False
5    False
6    False
7     True
8    False
dtype: bool

In [57]: groups = groups_diff_ne_1min.cumsum()
    ...: groups
Out[57]:
0    0
1    1
2    1
3    1
4    1
5    1
6    1
7    2
8    2
dtype: int64

In [58]: s.groupby(groups).agg(['first', 'last'])
Out[58]:
                first                last
0 2015-10-08 13:40:00 2015-10-08 13:40:00
1 2015-10-08 13:55:00 2015-10-08 14:00:00
2 2015-10-08 14:08:00 2015-10-08 14:09:00

【讨论】:

  • 谢谢你的工作,还有一种方法可以显示每个范围之间缺少多少数据点,例如[1, 6, 2]
  • 当然,将'size'添加到聚合列表
猜你喜欢
  • 2021-12-17
  • 1970-01-01
  • 1970-01-01
  • 2018-10-10
  • 2020-05-15
  • 2021-05-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多