在python中将不规则时间序列转换为每小时数据并具有正态分布答案

【问题标题】：convert irregular time series to hourly data in python and have normal distribution在python中将不规则时间序列转换为每小时数据并具有正态分布
【发布时间】：2018-12-25 18:47:49
【问题描述】：

我有一个如下所示的数据框：

Date Time Entry Exist 2013-01-07 05:00:00 29.0 12.0 2013-01-07 10:00:00 98.0 83.0 2013-01-07 15:00:00 404.0 131.0 2013-01-07 20:00:00 2340.0 229.0 2013-01-08 05:00:00 3443.0 629.0 2013-01-08 10:00:00 6713.0 1629.0 2013-01-08 15:00:00 9547.0 2965.0 2013-01-08 20:00:00 10440.0 4589.0

我想对其进行转换和规范化，以便它显示一段时间内的每小时消耗量。

DateTime Entry Exist 2013-01-07 00:00:00 2.0 1.0 2013-01-07 01:00:00 9.0 4.0 2013-01-07 02:00:00 16.0 6.0 2013-01-07 03:00:00 23.0 9.0 2013-01-07 04:00:00 26.0 10.0 2013-01-07 05:00:00 29.0 12.0 2013-01-07 06:00:00 37.0 19.0 2013-01-07 07:00:00 56.0 32.0 2013-01-07 08:00:00 62.0 57.0 2013-01-07 09:00:00 77.0 63.0 2013-01-07 10:00:00 98.0 83.0 2013-01-07 11:00:00 104.0 95.0 .......

我想先将日期和时间连接成一列作为DateTime，然后达到上述结果。

python新手，任何帮助将不胜感激。谢谢。

【问题讨论】：

从您的示例中，没有信息可以填补缺失的时间。例如2013-01-07 00:00:00怎么知道入口和存在？
我不明白您的 Entry 和 Exist 列发生了什么变化
我想我不清楚我的初始请求，我为输出提供的示例包含随机连续数。如果您看到 05 和 10 小时对于输入和输出具有相同的 Entry 和 Exist 值。而当我们将 DateTime 拆分为更多间隔时，这些值将是 NaN，因此我需要使用任何方法对其进行插值。 1 到 4 的 Entry 值应在 0 和 29 的范围内，Exist 在 0 和 12 的范围内。为了预测 1 到 4、6 到 9、11 到 14 等的 Entry 和 Exist 值，我需要插值的帮助。

标签： python python-3.x pandas datetime

【解决方案1】：

快速回答是你可以使用

DataFrame.resample().mean().interpolate()

至少做你帖子的插值部分。

请注意，您的帖子包含“域外”外推，因为您在输入数据的域之外进行预测。即时间序列从 1 月 7 日凌晨 5:00 开始，但是您的过采样数据在 提前 5 小时开始。插值只是域内方法，但我怀疑这就是你想要的。

这是插值的步骤。

首先，如果您可以发布一个包含代码的自包含示例，该示例可以生成用于测试的数据，或者可以通过某种方式重现它，这将很有帮助。

参考这两篇优秀的帖子：

Combine Date and Time columns using python pandas

How to create a Pandas DataFrame from a string

我是这样做的：

import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show

# copied and pasted from your post :)
data = StringIO("""
Date             Time         Entry       Exist
2013-01-07      05:00:00        29.0       12.0
2013-01-07      10:00:00        98.0       83.0
2013-01-07      15:00:00       404.0      131.0
2013-01-07      20:00:00      2340.0      229.0
2013-01-08      05:00:00      3443.0      629.0
2013-01-08      10:00:00      6713.0      1629.0
2013-01-08      15:00:00      9547.0      2965.0
2013-01-08      20:00:00     10440.0      4589.0""")

# read in the data,  converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data has separate date and time columns

df = pd.read_csv(data, 
    parse_dates={"date_time": ['Date', 'Time']}, 
    delim_whitespace=True)

现在，将数据设为时间序列，对其重新采样，应用函数（在本例中为平均值）并同时插入两个数据列。

df_rs = df.set_index('date_time').resample('H').mean().interpolate('linear')
df_rs

看起来像这样：

这些值看起来与您帖子中的值不完全一样，但不清楚使用的是哪种插值。线性，立方？其他？

为了好玩，让我们用散景绘制数据。大红点是原始数据，而蓝点（和连接线）是插值数据。

output_notebook()

p = figure(x_axis_type="datetime", width=800, height=500)

p.title.text = "Entry vs. Date Time (cubic interpolated to 1H)"
p.xaxis.axis_label = 'Date Time (cubic interpolated to 1H)'
p.yaxis.axis_label = 'Entry'

# orig data
p.circle(df['date_time'], df['Entry'], color='red', size=10)

# oversampled data
p.circle(df_rs.index, df_rs['Entry'])
p.line(df_rs.index, df_rs['Entry'])

show(p)

看起来像这样：

或者使用三次插值，你会得到更多的平滑：

完整代码

import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show

output_notebook()

# copied and pasted from your post :)
data = StringIO("""
Date            Time        ENTRIES       EXITS
2013-01-07      05:00:00        29.0       12.0
2013-01-07      10:00:00        98.0       83.0
2013-01-07      15:00:00       404.0      131.0
2013-01-07      20:00:00      2340.0      229.0
2013-01-08      05:00:00      3443.0      629.0
2013-01-08      10:00:00      6713.0      1629.0
2013-01-08      15:00:00      9547.0      2965.0
2013-01-08      20:00:00     10440.0      4589.0""")

# read in the data,  converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data as separate date and time columns
original_data = pd.read_csv(data, 
    parse_dates={"DATETIME": ['Date', 'Time']}, 
    delim_whitespace=True)

# make it a time series, resample to a higher freq, apply mean, interpolate and round
inter_data = original_data.set_index(['DATETIME']).resample('H').mean().interpolate('linear').round(1) 

# No need to drop the index to select a slice.  You can slice on the index
# I see you are starting at 1/1 (jan 1st),  yet your data starts at 1/7 (Jan 7th?)
inter_data[inter_data.index >= '2013-01-01 00:00:00'].head(20)

【讨论】：

我使用的代码是：inter_data = original_data.set_index(['DATETIME']).resample('H').mean().interpolate('linear') inter_data.reset_index(inplace=True) inter_data.ENTRIES=inter_data.ENTRIES.round(1) inter_data.EXITS=inter_data.EXITS.round(1) inter_data[inter_data.DATETIME >= '2013-01-01 00:00:00'].head(20)
我尝试对我的数据框实施相同的操作，但 Entry 和 Exist 值不连续。这是我得到的：index DATETIME ENTRIES EXITS 72 72 2013-01-01 00:00:00 4234935.6 2175034.2 73 73 2013-01-01 01:00:00 3249043.7 2697696.5 74 74 2013-01-01 02:00:00 1404653.5 828730.7 75 75 2013-01-01 03:00:00 3959076.0 2685191.8 76 76 2013-01-01 04:00:00 4397057.8 2409161.3 77 77 2013-01-01 05:00:00 2683292.3 2182695.4 78 78 2013-01-01 06:00:00 1168712.3 687371.6 79 79 2013-01-01 07:00:00 3969078.2 2700993.6
Entry 和 Exist 的值不连续。
我不确定你所说的不连续是什么意思。生成的时间序列是 1 小时增量的常规时间序列。非采样时间的值是采样时间之间的线性插值。还要注意，您可以在 .interpolate() 之后使用 .round(1) ，而不是在每列上调用它两次。如果有帮助，请对我的回答投赞成票:)。
时间序列很好。我说的是非采样时间的值。它们不在预期范围内。示例：如果 05:00:00 的值为 100，而 10:00:00 的值为 500。02,03,04 的值应介于 100 和 500 之间，并且 02 处的值