【问题标题】:Pandas time series - join by closest timePandas 时间序列 - 按最近时间加入
【发布时间】:2016-08-24 08:13:25
【问题描述】:

我有两个数据框,可以用以下 MWE 表示:

import pandas as pd
from datetime import datetime
import numpy as np

df_1 = pd.DataFrame(np.random.randn(9), columns = ['A'], index= [
                                                datetime(2015,1,1,19,30,1,20),
                                                datetime(2015,1,1,20,30,2,12),
                                                datetime(2015,1,1,21,30,3,50),
                                                datetime(2015,1,1,22,30,5,43),
                                                datetime(2015,1,1,22,30,52,11),
                                                datetime(2015,1,1,23,30,54,8),
                                                datetime(2015,1,1,23,40,14,2),
                                                datetime(2015,1,1,23,41,13,33),
                                                datetime(2015,1,1,23,50,21,32),
                                                ])

df_2 = pd.DataFrame(np.random.randn(9), columns = ['B'], index= [
                                                datetime(2015,1,1,18,30,1,20),
                                                datetime(2015,1,1,21,0,2,12),
                                                datetime(2015,1,1,21,31,3,50),
                                                datetime(2015,1,1,22,34,5,43),
                                                datetime(2015,1,1,22,35,52,11),
                                                datetime(2015,1,1,23,0,54,8),
                                                datetime(2015,1,1,23,41,14,2),
                                                datetime(2015,1,1,23,42,13,33),
                                                datetime(2015,1,1,23,56,21,32),
                                                ])

我想将两个数据框合并为一个,我知道我可以使用以下代码来做到这一点:

In [21]: df_1.join(df_2, how='outer')
Out[21]: 
                                   A         B
2015-01-01 18:30:01.000020       NaN -1.411907
2015-01-01 19:30:01.000020  0.109913       NaN
2015-01-01 20:30:02.000012 -0.440529       NaN
2015-01-01 21:00:02.000012       NaN -1.277403
2015-01-01 21:30:03.000050 -0.194020       NaN
2015-01-01 21:31:03.000050       NaN -0.042259
2015-01-01 22:30:05.000043  1.445220       NaN
2015-01-01 22:30:52.000011 -0.341176       NaN
2015-01-01 22:34:05.000043       NaN  0.905912
2015-01-01 22:35:52.000011       NaN -0.167559
2015-01-01 23:00:54.000008       NaN  1.289961
2015-01-01 23:30:54.000008 -0.929973       NaN
2015-01-01 23:40:14.000002  0.077622       NaN
2015-01-01 23:41:13.000033 -1.688719       NaN
2015-01-01 23:41:14.000002       NaN  0.178439
2015-01-01 23:42:13.000033       NaN -0.911314
2015-01-01 23:50:21.000032 -0.750953       NaN
2015-01-01 23:56:21.000032       NaN  0.092930

这不是我想要达到的目标。

我想仅针对 df_1 的时间序列索引将 df_2 与 df_1 合并 - 其中“B”列中的值将是时间最接近 df_1 中索引的值。

我以前使用iterrowsrelativedelta 实现了这一点,如下所示:

for i, row in df_1.iterrows():
    df_2_temp = df_2.copy()
    df_2_temp['Timestamp'] = df_2_temp.index
    df_2_temp['Time Delta'] = abs(df_2_temp['Timestamp'] - row.name).apply(lambda x: x.seconds)
    closest_value = df_2_temp.sort_values('Time Delta').iloc[0]['B']
    df_1.loc[row.name, 'B'] = closest_value

这可行,但速度很慢,而且我有非常大的数据帧要执行此操作。

有更快的解决方案吗?也许是内置的 Pandas?

【问题讨论】:

标签: python pandas time-series


【解决方案1】:

这可能会更快,即使 apply 仍然是幕后的循环。

def find_idxmin(dt):
    return (df_2.index - dt).to_series().reset_index(drop=True).abs().idxmin()

df_1.apply(lambda row: df_2.iloc[find_idxmin(row.name)], axis=1)

我将 DatetimeIndex 转换为一个系列,以便应用 absidxmin。我重置了索引,以便idxmin 返回一个我可以输入到iloc 的行号。


编辑:这似乎与 cmets 中链接到的基于 numpy 的答案一样快(5 毫秒):

def find_idxmin(dt):
    return np.argmin(np.abs(df_2.index.to_pydatetime() - dt))

相比之下,您的解决方案运行时间为 30 毫秒(而不是此处的 5 毫秒)。

【讨论】:

    【解决方案2】:

    Pandas 现在提供了我认为您正在寻找的功能:

    pd.merge_asof(df1, df2, direction='nearest')
    

    See merge_asof docs

    示例: 我有两个设备。 我每个设备都有一个 DataFrame,每个都有一个 Date 列,类型为“datetime64[ns, UTC]”

    t_df[['dt', 'mode', 'state']]:
                                    dt  mode  state
    0 2020-09-23 22:10:36.508000+00:00     1      0
    1 2020-09-23 22:10:57.463000+00:00     1      0
    2 2020-09-23 22:11:18.815000+00:00     1      0
    3 2020-09-23 22:12:16.806000+00:00     1      0
    4 2020-09-23 22:12:22.512000+00:00     1      0
    5 2020-09-23 22:12:43.469000+00:00     1      0
    6 2020-09-23 22:13:04.776000+00:00     1      0
    7 2020-09-23 22:13:25.948000+00:00     1      0
    8 2020-09-23 22:13:47.223000+00:00     1      0
    
    v_df[['dt', 'temperature', 'pressure']]: 
                                  dt  temperature  pressure
    0 2020-09-23 22:12:04.204000+00:00        74.85   1004.50
    1 2020-09-23 22:12:18.203000+00:00        74.82   1004.67
    2 2020-09-23 22:12:30.358000+00:00        74.85   1004.71
    3 2020-09-23 22:12:44.601000+00:00        74.82   1004.46
    4 2020-09-23 22:12:59.158000+00:00        74.82   1004.67
    5 2020-09-23 22:13:10.443000+00:00        74.82   1004.67
    6 2020-09-23 22:13:24.577000+00:00        74.82   1004.67
    7 2020-09-23 22:13:37.544000+00:00        74.82   1004.67
    8 2020-09-23 22:13:50.106000+00:00        74.78   1004.63
    9 2020-09-23 22:14:03.377000+00:00        74.78   1004.42
    

    我用过:

    new_df = pd.merge_asof(v_df[['dt', 'temperature', 'pressure']], t_df[['dt', 'mode', 'state']], direction='nearest')
    

    我的结果:

                                    dt  temperature  pressure  mode  state
    0 2020-09-23 22:12:04.204000+00:00        74.85   1004.50     1      0
    1 2020-09-23 22:12:18.203000+00:00        74.82   1004.67     1      0
    2 2020-09-23 22:12:30.358000+00:00        74.85   1004.71     1      0
    3 2020-09-23 22:12:44.601000+00:00        74.82   1004.46     1      0
    4 2020-09-23 22:12:59.158000+00:00        74.82   1004.67     1      0
    5 2020-09-23 22:13:10.443000+00:00        74.82   1004.67     1      0
    6 2020-09-23 22:13:24.577000+00:00        74.82   1004.67     1      0
    7 2020-09-23 22:13:37.544000+00:00        74.82   1004.67     1      0
    8 2020-09-23 22:13:50.106000+00:00        74.78   1004.63     1      0
    9 2020-09-23 22:14:03.377000+00:00        74.78   1004.42     1      0
    

    ** 这个例子只是每个 DataFrame 的最后 10 个案例,它们的顶部相隔几分钟。以下是在完整 DataFrames 上运行后的最后 10 个案例(注意:分别在 df1 和 df2 的合并操作中添加了“日期”和“时间”以供参考):

    combo_df.iloc[-10:][['dt', 'date', 'time', 'pressure', 'temperature', 'mode', 'state']]
    
                                       dt                      date                      time  pressure  temperature  mode  state
    4440 2020-09-23 22:12:04.204000+00:00  2020-09-23T22:12:04.204Z  2020-09-23T22:12:16.806Z   1004.50        74.85     1      0
    4441 2020-09-23 22:12:18.203000+00:00  2020-09-23T22:12:18.203Z  2020-09-23T22:12:16.806Z   1004.67        74.82     1      0
    4442 2020-09-23 22:12:30.358000+00:00  2020-09-23T22:12:30.358Z  2020-09-23T22:12:22.512Z   1004.71        74.85     1      0
    4443 2020-09-23 22:12:44.601000+00:00  2020-09-23T22:12:44.601Z  2020-09-23T22:12:43.469Z   1004.46        74.82     1      0
    4444 2020-09-23 22:12:59.158000+00:00  2020-09-23T22:12:59.158Z  2020-09-23T22:13:04.776Z   1004.67        74.82     1      0
    4445 2020-09-23 22:13:10.443000+00:00  2020-09-23T22:13:10.443Z  2020-09-23T22:13:04.776Z   1004.67        74.82     1      0
    4446 2020-09-23 22:13:24.577000+00:00  2020-09-23T22:13:24.577Z  2020-09-23T22:13:25.948Z   1004.67        74.82     1      0
    4447 2020-09-23 22:13:37.544000+00:00  2020-09-23T22:13:37.544Z  2020-09-23T22:13:47.223Z   1004.67        74.82     1      0
    4448 2020-09-23 22:13:50.106000+00:00  2020-09-23T22:13:50.106Z  2020-09-23T22:13:47.223Z   1004.63        74.78     1      0
    4449 2020-09-23 22:14:03.377000+00:00  2020-09-23T22:14:03.377Z  2020-09-23T22:14:08.981Z   1004.42        74.78     1      0
    

    【讨论】:

      猜你喜欢
      • 2015-12-22
      • 1970-01-01
      • 2019-12-09
      • 1970-01-01
      • 2021-05-17
      • 2016-11-29
      • 1970-01-01
      • 1970-01-01
      • 2020-01-29
      相关资源
      最近更新 更多