Pandas- 在各种日期时间范围内分箱分钟数答案

【问题标题】：Pandas- binning number of minutes in various datetime rangesPandas- 在各种日期时间范围内分箱分钟数
【发布时间】：2019-07-06 15:02:53
【问题描述】：

我正在寻找一种有效的方法来处理 pandas 中的以下数据。

我有一个包含几十万个开始和结束时间戳的数据框：

data_df
                      start_ts                     end_ts
0    2019-06-10 12:00:00+00:00  2019-06-10 22:30:00+00:00
1    2019-06-11 12:00:00+00:00  2019-06-11 13:30:00+00:00
2    2019-06-11 14:00:00+00:00  2019-06-11 19:00:00+00:00
3    2019-06-14 12:00:00+00:00  2019-06-14 18:30:00+00:00
4    2019-06-10 12:00:00+00:00  2019-06-10 21:30:00+00:00
5    2019-06-11 12:00:00+00:00  2019-06-11 18:30:00+00:00
...

我还有一组带标签的时间箱 (tp1-tp10)。每天有 10 个垃圾箱，但这些垃圾箱的时间可能每天都在变化（例如 - tp1 可能是一天的 00:00 到 01:30，但另一天可能是 00:00 到 01:45天）。每个要处理的数据集有 7 天，每天有 10 个时间段，因此范围集的大小为 70，如下所示：

labeled_bins_df
                   start_range                  end_range  label
0    2019-06-10 00:00:00+00:00  2019-06-10 04:30:00+00:00    tp1
1    2019-06-10 04:30:00+00:00  2019-06-10 09:45:00+00:00    tp2
2    2019-06-10 09:45:00+00:00  2019-06-10 12:30:00+00:00    tp3
...

我想要一个包含原始 data_df 数据的表格，但有额外的列，tp1 到 tp10，每行的分钟数：

timed_bins
                      start_ts                     end_ts    tp1    tp2    tp3    tp4 ...
0    2019-06-10 12:00:00+00:00  2019-06-10 22:30:00+00:00      0      0     30    120 ...
1    2019-06-11 12:00:00+00:00  2019-06-11 13:30:00+00:00      0     45     45      0 ...

我目前正在天真地执行此操作，循环遍历我的行，并搜索每个数据行所在的 bin，正如您可以想象的那样，这非常慢。是否可以执行任何 pandas-fu 来对日期时间范围进行这种分箱？

编辑：一个想法，可能有助于朝着新的方向思考。如果我要将我的所有时间戳（在我的数据和我的标签箱中）转换为 unix 时间戳（自 1970 年 1 月 1 日以来的秒数），那么这将是一个基于整数范围而不是日期的分箱/求和问题.然后，这将产生每个 bin 中的秒数，只需除以 60，我就可以得到每个 bin 中的分钟数。这消除了对日期边界等的所有担忧。

编辑 2：根据要求，这里是一组简化的样本数据，使用三个不同的时间段。我专门制作了一个数据样本（第二行）跨越 2 天。此外，还有一个 result_df 显示了预期的输出。

data_samples = [
    {'start_ts': '2019-06-10T12:00:00+0000', 'end_ts': '2019-06-10T22:30:00+0000'},
    {'start_ts': '2019-06-10T22:00:00+0000', 'end_ts': '2019-06-11T05:30:00+0000'},
    {'start_ts': '2019-06-10T10:00:00+0000', 'end_ts': '2019-06-10T14:15:00+0000'},
    {'start_ts': '2019-06-12T08:07:00+0000', 'end_ts': '2019-06-12T18:22:00+0000'},
    {'start_ts': '2019-06-11T14:03:00+0000', 'end_ts': '2019-06-11T15:30:00+0000'},
    {'start_ts': '2019-06-11T02:33:00+0000', 'end_ts': '2019-06-11T10:31:00+0000'}
]

data_set = [{
    'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z')} for x in data_samples]

data_df = pd.DataFrame(data_set)[['start_ts', 'end_ts']]

time_bin_samples = [
    {'start_ts': '2019-06-10T00:00:00+0000', 'end_ts': '2019-06-10T08:15:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-10T08:15:00+0000', 'end_ts': '2019-06-10T18:00:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-10T18:00:00+0000', 'end_ts': '2019-06-11T00:00:00+0000', 'label': 't3'},

    {'start_ts': '2019-06-11T00:00:00+0000', 'end_ts': '2019-06-11T09:00:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-11T09:00:00+0000', 'end_ts': '2019-06-11T19:15:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-11T19:15:00+0000', 'end_ts': '2019-06-12T00:00:00+0000', 'label': 't3'},

    {'start_ts': '2019-06-12T00:00:00+0000', 'end_ts': '2019-06-12T10:30:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-12T10:30:00+0000', 'end_ts': '2019-06-12T12:00:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-12T12:00:00+0000', 'end_ts': '2019-06-13T00:00:00+0000', 'label': 't3'},
]

time_bin_set = [{
    'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'label': x['label']} for x in time_bin_samples
]

time_bin_df = pd.DataFrame(time_bin_set)[['start_ts', 'end_ts', 'label']]

result_set = [
    {'t1': 0, 't2': 360, 't3': 270},
    {'t1': 330, 't2': 0, 't3': 120},
    {'t1': 0, 't2': 255, 't3': 0},
    {'t1': 143, 't2': 90, 't3': 382},
    {'t1': 0, 't2': 87, 't3': 0},
    {'t1': 387, 't2': 91, 't3': 0}
]

result_df = pd.DataFrame(result_set)

【问题讨论】：

start_ts 和 end_ts 是否跨越多个日期？
是的，它们可以跨越一天到另一天。
这确实使事情变得复杂，因为tp_i 每天都在变化，您事先并不知道您是需要第一天的 tp1 还是第二天的 tp1。现在的问题是，这些跨度中的任何一个是否足够大，以至于它们在不同的日子（有不同的限制）重叠相同的tp_i 跨度？然后会发生什么？
我怀疑第一天的tp1 和第二天的tp1 是否存在单行（尽管我想这并非不可能）。话虽这么说，我不在乎每天有多少.. 我只是想总结在tp1 中花费的总时间所以如果有一个很长的样本，那么在@987654337 中包含 30 分钟的时间@ 从第 1 天开始，tp1 从第 2 天开始有 15 分钟的时间，我需要总共 45 分钟的 tp1 时间。
澄清一下-通过'需要tp1 的总时间'，我的意思是tp_i 对于所有i（不仅仅是tp_1）。

标签： python pandas python-datetime binning

【解决方案1】：

我知道迭代数据框的行效率不高。

在这里，我将尝试使用merge_asof 来识别data_df 中每行的第一个和最后一个bin。

然后，我将通过迭代一次数据帧值来构建一个子数据帧列表，以便添加与一行对应的所有 bin，并连接该列表。

从那里计算每个 bin 的时间间隔并使用 pivot_table 获得预期结果就足够了。

代码可以是：

# store the index as a column to make sure to keep it
data_df = data_df.rename_axis('ix').reset_index().sort_values(
    ['end_ts', 'start_ts'])
time_bin_df = time_bin_df.rename_axis('ix').reset_index().sort_values(
    ['end_ts', 'start_ts'])

# identify first and last bin per row
first = pd.merge_asof(data_df, time_bin_df, left_on='start_ts',
                      right_on='end_ts', suffixes=('', '_first'),
                      direction='forward').values
last = pd.merge_asof(data_df, time_bin_df, left_on='end_ts', right_on='start_ts',
                     suffixes=('', '_ bin')).values

# build a list of bin dataframes (one per row in data_df)
data = []
for i, val in enumerate(first):
    elt = time_bin_df[(time_bin_df['ix']>=val[3])
                      &(time_bin_df['ix']<=last[i][3])].copy()
    # compute the begin and end of the intersection of the period and the bin
    elt.loc[elt['start_ts']<val[1], 'start_ts'] = val[1]
    elt.loc[elt['end_ts']>val[2], 'end_ts'] = val[2]
    elt['ix_data'] = val[0]
    data.append(elt)

# concat everything
tmp = pd.concat(data)

# compute durations in minutes
tmp['duration'] = (tmp['end_ts'] - tmp['start_ts']).dt.total_seconds() / 60

# pivot to get the expected result
result_df = tmp.pivot_table('duration', 'ix_data', 'label', 'sum', fill_value=0
                            ).rename_axis(None).rename_axis(None, axis=1)

这可能需要一些时间，因为构建数据帧列表仍然需要一个冗长的操作，但其他操作应该向量化。

【讨论】：