【发布时间】:2014-11-11 08:56:20
【问题描述】:
我有以下数据框 df1 和 df2。我希望 df3 退出加入他们,如下所述。
df1 和 df2 都包含带有时间戳的事件,用于特定的机器。
在 df3 中,我想拥有所有 df1,但还为每一行添加 df2 中的事件时间戳,用于最接近 df1 中行的时间戳但在它之前的同一台机器。如果 df1 事件之前没有 df2 事件,则该新值可以为空。
所以这是一种合并操作,除了两个表之间的链接是“机器”上的相等性,而是应该在时间戳的一个方向上最小化的不等式。
这是生成示例数据帧的代码:
import pandas as pd
df1=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Status":["blah","foo","bar","blah","foo","bar"],"Date-time":["2014-02-20 11:00:19.0","2014-02-21 12:29:55.0","2014-02-20 11:00:21.0","2014-02-19 09:10:19.0","2014-02-18 12:19:47.0","2014-02-20 1:33:00.0"]})
df1["Date-time"]=pd.to_datetime(df1["Date-time"])
df2=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Date of maintenance":["2014-02-20","2014-02-21","2014-02-20","2014-02-10","2014-02-07","2014-02-03"]})
df2["Date of maintenance"]=pd.to_datetime(df2["Date of maintenance"])
df3=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Status":["blah","foo","bar","blah","foo","bar"],"Date-time":["2014-02-20 11:00:19.0","2014-02-21 12:29:55.0","2014-02-20 11:00:21.0","2014-02-19 09:10:19.0","2014-02-18 12:19:47.0","2014-02-20 1:33:00.0"],"Date of last maintenance":["2014-02-20","2014-02-21","2014-02-20","2014-02-10","2014-02-07","2014-02-20"]})
编辑:
所以我记下了以下内容。我在那里有一些重复,但我应该能够轻松地处理它们。缺少的大部分是如何通过机器而不是整个表进行匹配。
import pandas as pd
import numpy as np
df1=pd.DataFrame({"Machine":[0,2,3,0,2,3,0,1,0],"Status":["blah","foo","bar","blah","foo","bar","blah","foo","bar"],"Date-time":["2014-02-20 11:00:19.0","2014-02-21 12:29:55.0","2014-02-20 11:00:21.0","2014-02-19 09:10:19.0","2014-02-18 12:19:47.0","2014-02-20 1:33:00.0","2014-02-07 04:10:19.0","2014-02-19 11:11:47.0","2014-03-20 1:23:00.0"]})
df1["Date-time"]=pd.to_datetime(df1["Date-time"])
df1=df1.sort(["Date-time"])
df1=df1.reset_index(drop=True)
df2=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Date of maintenance":["2014-02-20","2014-02-21","2014-02-20","2014-02-10","2014-02-07","2014-02-03"]})
df2["Date of maintenance"]=pd.to_datetime(df2["Date of maintenance"])
df2=df2.sort(["Date of maintenance"])
df2=df2.reset_index(drop=True)
df2["searchsortindex"]=np.searchsorted(np.array(df1["Date-time"]), np.array(df2["Date of maintenance"]), side='left', sorter=None)
df3=pd.merge(df1,df2,how='left',left_index=True,right_on='searchsortindex')
【问题讨论】: