【发布时间】:2021-04-07 19:15:00
【问题描述】:
我想从整点开始每 15 分钟获取一次读数,给定一组每小时读取但与每小时偏移分钟的读数。
我的第一种方法是使用 resample 到 15 分钟,但我没有得到预期的结果:
因此,如果读数是整点的,重新采样就可以正常工作:
left_key = pd.to_datetime(['2020-12-01 00:00',
'2020-12-01 01:00',
'2020-12-01 02:00',
'2020-12-01 03:00',
'2020-12-01 04:00',
'2020-12-01 05:00'])
left_data = pd.Series([12,12,13,15,16,15], index=left_key, name='master')
resampled = left_data.resample('15min')
resampled.interpolate(method='spline', order=2)
满足我的需要:
2020-12-01 00:00:00 12.000000
2020-12-01 00:15:00 11.777455
2020-12-01 00:30:00 12.079464
2020-12-01 00:45:00 12.370313
2020-12-01 01:00:00 12.000000
2020-12-01 01:15:00 12.918527
2020-12-01 01:30:00 13.175893
但如果读数偏离小时:
left_key = pd.to_datetime(['2020-12-01 00:06',
'2020-12-01 01:06',
'2020-12-01 02:06',
'2020-12-01 03:06',
'2020-12-01 04:06',
'2020-12-01 05:06'])
left_data = pd.Series([12,12,13,15,16,15], index=left_key, name='master')
resampled = left_data.resample('15min')
resampled.interpolate(method='spline', order=2)
现在我没有数据
2020-12-01 00:00:00 NaN
2020-12-01 00:15:00 NaN
2020-12-01 00:30:00 NaN
2020-12-01 00:45:00 NaN
2020-12-01 01:00:00 NaN
如果我每小时重新采样一次,它只会将读数向后移动
resampled = left_data.resample('H')
resampled.interpolate(method='spline', order=2)
2020-12-01 00:00:00 12
2020-12-01 01:00:00 12
2020-12-01 02:00:00 13
2020-12-01 03:00:00 15
2020-12-01 04:00:00 16
2020-12-01 05:00:00 15
有没有办法让重新采样来插入读数,这样我就可以得到正确的小时值? (这个问题有更好的标题吗!)
更新
虽然这些解决方案有效,但它不适合处理大量数据。 1000 行对我的机器来说太多了!即使减少初始重采样大小也需要大量内存和时间来完成。
这是这个问题的另一个解决方案:Interpolate one time series onto custom time series
# create a new index for the ranges of datetimes required
starts = df.index.min()
starts = datetime(starts.year, starts.month, starts.day, starts.hour,15*(starts.minute // 15))
master = pd.date_range(starts, df.index.max(), freq="15min")
# will need this to identify original data rows later
df['tag'] = True
# merge with original data and interpolate missing rows
idx = df.index.union(master)
df2 = df.reindex(idx).interpolate('index')
# now remove the things we don't want
df2.drop(df2.index[0], inplace=True) # first value will be NaN (unless has real data)
# use the tag column to remove the original data and then drop that column
df2 = df2[df2['tag'].isna()]
df2.drop(columns=['tag',], inplace=True)
这要快得多!
【问题讨论】:
标签: pandas time-series