【发布时间】:2016-08-24 08:13:25
【问题描述】:
我有两个数据框,可以用以下 MWE 表示:
import pandas as pd
from datetime import datetime
import numpy as np
df_1 = pd.DataFrame(np.random.randn(9), columns = ['A'], index= [
datetime(2015,1,1,19,30,1,20),
datetime(2015,1,1,20,30,2,12),
datetime(2015,1,1,21,30,3,50),
datetime(2015,1,1,22,30,5,43),
datetime(2015,1,1,22,30,52,11),
datetime(2015,1,1,23,30,54,8),
datetime(2015,1,1,23,40,14,2),
datetime(2015,1,1,23,41,13,33),
datetime(2015,1,1,23,50,21,32),
])
df_2 = pd.DataFrame(np.random.randn(9), columns = ['B'], index= [
datetime(2015,1,1,18,30,1,20),
datetime(2015,1,1,21,0,2,12),
datetime(2015,1,1,21,31,3,50),
datetime(2015,1,1,22,34,5,43),
datetime(2015,1,1,22,35,52,11),
datetime(2015,1,1,23,0,54,8),
datetime(2015,1,1,23,41,14,2),
datetime(2015,1,1,23,42,13,33),
datetime(2015,1,1,23,56,21,32),
])
我想将两个数据框合并为一个,我知道我可以使用以下代码来做到这一点:
In [21]: df_1.join(df_2, how='outer')
Out[21]:
A B
2015-01-01 18:30:01.000020 NaN -1.411907
2015-01-01 19:30:01.000020 0.109913 NaN
2015-01-01 20:30:02.000012 -0.440529 NaN
2015-01-01 21:00:02.000012 NaN -1.277403
2015-01-01 21:30:03.000050 -0.194020 NaN
2015-01-01 21:31:03.000050 NaN -0.042259
2015-01-01 22:30:05.000043 1.445220 NaN
2015-01-01 22:30:52.000011 -0.341176 NaN
2015-01-01 22:34:05.000043 NaN 0.905912
2015-01-01 22:35:52.000011 NaN -0.167559
2015-01-01 23:00:54.000008 NaN 1.289961
2015-01-01 23:30:54.000008 -0.929973 NaN
2015-01-01 23:40:14.000002 0.077622 NaN
2015-01-01 23:41:13.000033 -1.688719 NaN
2015-01-01 23:41:14.000002 NaN 0.178439
2015-01-01 23:42:13.000033 NaN -0.911314
2015-01-01 23:50:21.000032 -0.750953 NaN
2015-01-01 23:56:21.000032 NaN 0.092930
这不是我想要达到的目标。
我想仅针对 df_1 的时间序列索引将 df_2 与 df_1 合并 - 其中“B”列中的值将是时间最接近 df_1 中索引的值。
我以前使用iterrows 和relativedelta 实现了这一点,如下所示:
for i, row in df_1.iterrows():
df_2_temp = df_2.copy()
df_2_temp['Timestamp'] = df_2_temp.index
df_2_temp['Time Delta'] = abs(df_2_temp['Timestamp'] - row.name).apply(lambda x: x.seconds)
closest_value = df_2_temp.sort_values('Time Delta').iloc[0]['B']
df_1.loc[row.name, 'B'] = closest_value
这可行,但速度很慢,而且我有非常大的数据帧要执行此操作。
有更快的解决方案吗?也许是内置的 Pandas?
【问题讨论】:
-
这是:stackoverflow.com/questions/15115547/… 有什么帮助吗?
-
EdChum 是对的。参考stackoverflow.com/questions/15115547/…
-
仍然不太确定如何将那里的答案应用到我的 MWE,你们中的某个人可以举个例子吗?
-
由于某种原因我无法运行您的循环。我很想看看与我建议的答案的时间比较。
标签: python pandas time-series