查找两个数据框之间最近日期的最有效方法答案

【问题标题】：Most Efficient Way to Find Closest Date Between 2 Dataframes查找两个数据框之间最近日期的最有效方法
【发布时间】：2019-12-23 06:23:43
【问题描述】：

我有一个每小时天气数据集，已导入到 pandas 数据框。在这个数据框中，我有 2 列如下（除了其他列）：

wd = pd.read_csv('hourlyweather.csv')  # wd is short for Weather Data
wd['Date and Time']= wd['Date and Time'].astype('datetime64[ns]')
wd['Date and Time (int)'] = wd['Date and Time'].astype('int') 
wd['Temperature Celsius'] = wd['Temperature Celsius'].astype('double')

我还有另一个数据集（用于每小时车祸），它具有不同的数据但类似的列如下：

cd = pd.read_csv('accidents.csv')  # cd is short for Crime Data
cd['Occurred Date Time']= cd['Occurred Date Time'].astype('datetime64[ns]')
cd['Occurred Date Time (int)']= cd['Occurred Date Time'].astype('int') 
cd.insert(6,"Temp in Celsius"," ");

我的目标是找出每次车祸的天气温度。由于我没有确切的每小时温度，我想从天气数据集中找到每次事故的最接近温度。因此，对于每次事故，我想从天气数据集中找到最接近的日期和时间，然后获取该日期和时间的温度以将其插入到车祸数据框中的相应列中

我尝试通过 FOR LOOP 来完成它（它工作正常），但它需要很长时间来处理。这是因为我有超过一百万的车祸。以下是我的 FOR 循环：

for i in range((len(cd['Occurred Date Time (int)']))):
    sourceint =cd['Occurred Date Time (int)'][i]
    idx = wd['Date and Time (int)'].sub(sourceint).abs().idxmin()
    cd["Temp in Celsius"][i] = wd['Temperature Celsius'][idx]

没有 FOR LOOP 是否有更有效的方法来执行此操作，这样可以更快地处理这么多的记录？

Here are some demo to the CSV files above

【问题讨论】：

您的 csv 是否已按“日期和时间”排序？
@Gepapado 是的，天气（wd）CSV 已经排序。
@Gepapado 事故 CSV 按天排序，而不是按小时排序。
在 'day' 对两个数据集进行完全左连接会很好。然后为每一行找到两次之间的最小值，然后删除重复项。
如果您提供两个数据集的演示数据框，我可以为您提供更多帮助

标签： python pandas dataframe datetime

【解决方案1】：

您可以使用 pd.merge_asof 合并两个数据框。您需要对左右数据框进行排序并删除重复项。

cd['Occurred Date Time'] = pd.to_datetime(cd['Occurred Date Time'])
wd['Date and Time'] = pd.to_datetime(wd['Date and Time'])

wd.drop_duplicates(subset = ['Date and Time'], inplace = True)
wd.sort_values(by = ['Date and Time'],inplace=True)
cd.drop_duplicates(subset = ['Occurred Date Time'], inplace = True)
cd.sort_values(by = ['Occurred Date Time'],inplace=True)

df = pd.merge_asof(cd,wd, left_on = 'Occurred Date Time', right_on = 'Date and Time')

【讨论】：

合并给我一个错误：ValueError：合并键在左侧包含空值
我试图删除空值，但左侧数据帧上没有空值

【解决方案2】：

如果代码不能完全理解，请告诉我



df_accident['datetime'] = df_accident['datetime'].apply(lambda x:pd.Timestamp(x))
df_accident['year'] = df_accident['datetime'].apply(lambda x:x.year)
df_accident['month']= df_accident['datetime'].apply(lambda x:x.month)
df_accident['day'] = df_accident['datetime'].apply(lambda x:x.day)
df_accident['hour'] = df_accident['datetime'].apply(lambda x:x.hour)
df_accident['minute'] = df_accident['datetime'].apply(lambda x:x.minute)


df_weather['datetime'] = df_weather['datetime'].apply(lambda x:pd.Timestamp(x))
df_weather['year'] = df_weather'datetime'].apply(lambda x:x.year)
df_weather['month']= df_weather['datetime'].apply(lambda x:x.month)
df_weather['day'] = df_weather['datetime'].apply(lambda x:x.day)
df_weather['hour'] = df_weather['datetime'].apply(lambda x:x.hour)
df_weather['minute'] = df_weather['datetime'].apply(lambda x:x.minute)

columns = ['year','month','day','hour','minute']
joint_dfs_array = []
for i in range(5):
    cols = columns[:5-i]

    joint_df  = df_accident.merge(df_weather,on=cols,how='left')
    joint_dfs_array.append(df_accident[~joint_df[['datetime','location','temp']]])


final_df = joint_dfs_array[0]
for i in range(1,len(joint_dfs_array)):

    final_df = final_df.concat(joint_dfs_array[i],axis=0)

final_df 是答案。

【讨论】：

嗨@Parijat。由于您使用的是 lambda 函数，我无法理解您的答案。在演示数据集上运行答案时，它给了我很多错误。
这应该只是指向一个方向。逻辑是 a) 将日期转换为时间戳，然后为两个数据帧提取日、月、年、小时和分钟。如果事件完全匹配，即达到最小的时间值（分钟），则事件可以在天气数据集中具有最接近的时间戳。如果没有，那么它可以匹配直到小时值。如果它甚至没有，它可能会匹配到一天。因此，您首先要在年、月、日、小时和分钟上进行左外连接。找到那些你找到答案的，然后过滤掉。重复这个过程，直到你得到所有的答案