如何根据小时差异标准在熊猫中插入新行答案

【问题标题】：How to insert new line in pandas on hour differences criteria如何根据小时差异标准在熊猫中插入新行
【发布时间】：2021-04-30 06:26:10
【问题描述】：

我有以下数据框：

  Matricule Startdate   Starthour   Enddate     Endhour
0   5357    2019-01-08  14:21:06    2019-01-08  14:34:42
1   5357    2019-01-08  15:29:23    2019-01-08  15:33:43
2   5357    2019-01-08  19:51:11    2019-01-08  20:02:48
3   5357    2019-03-08  20:05:49    2019-03-08  21:04:52
4   aaaa    2019-01-08  14:17:51    2019-01-08  14:32:10
5   aaaa    2019-01-08  18:21:16    2019-01-08  18:39:26

我正在尝试制作一个表格，我在每个新行之间插入，这是基于第 1 行的到达时间和第 2 行的出发时间之间的差异大于 30 分钟的条件。要插入的行与前一行具有相同的属性。这是一个例子：

     Matricule  Startdate   Starthour   Enddate     Endhour
    0   5357    2019-01-08  14:21:06    2019-01-08  14:34:42
    1   5357    2019-01-08  14:34:42    2019-01-08  15:04:42
    2   5357    2019-01-08  15:29:23    2019-01-08  15:33:43
    3   5357    2019-01-08  15:33:43    2019-01-08  16:03:43
    4   5357    2019-01-08  19:51:11    2019-01-08  20:02:48
    5   5357    2019-03-08  20:05:49    2019-03-08  21:04:52
    6   aaaa    2019-01-08  14:17:51    2019-01-08  14:32:10
    7   aaaa    2019-01-08  14:32:10    2019-01-08  15:02:10
    8   aaaa    2019-01-08  18:21:16    2019-01-08  18:39:26

【问题讨论】：

所以你总是想插入一行，插入行的时间应该是之前相邻行时间的中点？
如果第 n 行的开始时间和第 n-1 行的结束时间之间没有 30 分钟，那么我不想添加一行。但如果有，我想添加行 n-1 但与你所说的不同时间

标签： pandas datetime duplicates conditional-statements criteria

【解决方案1】：

首先，我创建了以日期和时间为统一对象的新列：

df['start'] = df['Startdate'].astype(str) + " " + df['Starthour'].astype(str)
df['start'] = pd.to_datetime(df['start'])
df['end'] = df['Enddate'] + " " + df['Endhour']
df['end'] = pd.to_datetime(df['end'])

接下来，计算到下一条记录的间隔，确保它首先排序：

df = df.sort_values(['Matricule','start'])
df['gap_to_next'] = (df['start'].shift(-1) - df['end'])

处理不同矩阵之间的不匹配：

cut = df['Matricule'] != df['Matricule'].shift(-1)
df.loc[cut, 'gap_to_next'] = np.nan

定义一个布尔系列，显示您需要插入新行的位置。我使用了您大约 30 分钟的请求，但添加了一些关于确保事情相隔不到 1 天的内容，因为您的样本有一个案例似乎暗示了这一点。根据需要进行调整：

should_insert_next = ( (df['gap_to_next'] > pd.Timedelta(30, 'min')) & (df['gap_to_next'] < pd.Timedelta(24, 'hr')) )

只复制这些行：

new_rows = df[should_insert_next].copy()

使用这些行作为模板，将时间调整为您想要的插入时间。似乎您希望新记录从头到尾 30 分钟。

new_rows['start'] = new_rows['end']
new_rows['end'] = new_rows['start'] + pd.Timedelta(30, 'min')

如果您的原始日期和小时列不是字符串，您可以在下面添加一个步骤，将它们转换为它们的任何类型...

new_rows['Startdate'] = new_rows['start'].dt.strftime("%Y-%m-%d")
new_rows['Enddate'] = new_rows['end'].dt.strftime("%Y-%m-%d")
new_rows['Starthour'] = new_rows['start'].dt.strftime("%H:%M:%S")
new_rows['Endhour'] = new_rows['end'].dt.strftime("%H:%M:%S")

最后，将新旧连接在一起并诉诸：

final = pd.concat([df, new_rows])
final = final.sort_values(['Matricule','start'])
final = final.drop(columns=['gap_to_next','start','end'])
final = final.reset_index(drop=True)

这给了：

print(final)
  Matricule   Startdate Starthour     Enddate   Endhour
0      5357  2019-01-08  14:21:06  2019-01-08  14:34:42
1      5357  2019-01-08  14:34:42  2019-01-08  15:04:42
2      5357  2019-01-08  15:29:23  2019-01-08  15:33:43
3      5357  2019-01-08  15:33:43  2019-01-08  16:03:43
4      5357  2019-01-08  19:51:11  2019-01-08  20:02:48
5      5357  2019-03-08  20:05:49  2019-03-08  21:04:52
6      aaaa  2019-01-08  14:17:51  2019-01-08  14:32:10
7      aaaa  2019-01-08  14:32:10  2019-01-08  15:02:10
8      aaaa  2019-01-08  18:21:16  2019-01-08  18:39:26

【讨论】：