假设可以像这样重现您的数据(请注意未来的问题,您可以如何创建虚拟数据以帮助您更快地回答您的问题)。
df_trips = pd.DataFrame([
['2021-08-24 15:50:27.063000+00:00','2021-08-24 16:54:54+00:00', "B8", 1],
['2021-08-28 15:50:27.063000+00:00','2021-08-30 16:54:54+00:00', "B8", 2],
['2021-08-24 16:50:27.063000+00:00','2021-08-24 16:54:54+00:00', "A7", 3],
], columns=['start', 'stop', 'device', 'trip_id'])
df_trips['start'] = pd.to_datetime(df_trips['start'])
df_trips['stop'] = pd.to_datetime(df_trips['stop'])
print(df_trips)
start stop device trip_id
0 2021-08-24 15:50:27.063000+00:00 2021-08-24 16:54:54+00:00 B8 1
1 2021-08-28 15:50:27.063000+00:00 2021-08-30 16:54:54+00:00 B8 2
2 2021-08-24 16:50:27.063000+00:00 2021-08-24 16:54:54+00:00 A7 3
和
df_points = pd.DataFrame([
['2021-08-24 15:52:27.063000+00:00',"B8", 1],
['2021-08-25 15:50:27.063000+00:00',"B8", 2],
['2021-08-28 16:50:27.063000+00:00',"B8", 3],
['2021-08-29 15:50:27.063000+00:00',"B8", 4],
['2021-08-24 16:51:27.063000+00:00',"A7", 5],
], columns=['dateTime', 'device', 'point_id'])
df_points['dateTime'] = pd.to_datetime(df_points['dateTime'])
print(df_points)
dateTime device point_id
0 2021-08-24 15:52:27.063000+00:00 B8 1 # in trip 1
1 2021-08-25 15:50:27.063000+00:00 B8 2 # no trip
2 2021-08-28 16:50:27.063000+00:00 B8 3 # trip 2
3 2021-08-29 15:50:27.063000+00:00 B8 4 # trip 2
4 2021-08-24 16:51:27.063000+00:00 A7 5 # trip 3, overlap time other device
你的工作也是如此,首先按设备使用merge_asof,然后从 dateTime 列向后看到上一个开始(同一设备)
points = pd.merge_asof(
# sort both dataframe for merging column mandatory
df_points.sort_values('dateTime'),
df_trips.sort_values('start'),
# first merge by devce
by='device',
# merge_asof on dateTime and start
left_on='dateTime',
right_on='start',
# look for start before dateTime
direction='backward'
)
print(points.sort_values('point_id'))
dateTime device point_id \
0 2021-08-24 15:52:27.063000+00:00 B8 1
2 2021-08-25 15:50:27.063000+00:00 B8 2
3 2021-08-28 16:50:27.063000+00:00 B8 3
4 2021-08-29 15:50:27.063000+00:00 B8 4
1 2021-08-24 16:51:27.063000+00:00 A7 5
start stop trip_id
0 2021-08-24 15:50:27.063000+00:00 2021-08-24 16:54:54+00:00 1
2 2021-08-24 15:50:27.063000+00:00 2021-08-24 16:54:54+00:00 1
3 2021-08-28 15:50:27.063000+00:00 2021-08-30 16:54:54+00:00 2
4 2021-08-28 15:50:27.063000+00:00 2021-08-30 16:54:54+00:00 2
1 2021-08-24 16:50:27.063000+00:00 2021-08-24 16:54:54+00:00 3
几乎不错,但是您可以看到第二行 dateTime 不在 start 和 stop 之间,因此当 dateTime 高于 stop 时,您可以将 trip_id 替换为 pd.NA 或 None找到行程,或删除此类行。
points.loc[points['dateTime']>points['stop'], 'trip_id'] = pd.NA
points = points[list(df_points.columns)+['trip_id']]
# or remove the rows without trip_id
#points = points.loc[points['dateTime']<=points['stop'],
# list(df_points.columns)+['trip_id']]
print(points)
dateTime device point_id trip_id
0 2021-08-24 15:52:27.063000+00:00 B8 1 1
1 2021-08-24 16:51:27.063000+00:00 A7 5 3
2 2021-08-25 15:50:27.063000+00:00 B8 2 <NA>
3 2021-08-28 16:50:27.063000+00:00 B8 3 2
4 2021-08-29 15:50:27.063000+00:00 B8 4 2