【发布时间】:2021-06-09 22:28:27
【问题描述】:
大家好,我在这里需要解决方案方面的帮助,问题是我需要获取创建日期和取消日期内每个日期的每个 ID,如下图所示
我有以下“有效”的代码(但仅适用于小型数据集):
# `date_range` is slow so we only call it once
all_dates = pd.date_range(test['Date'].min(), test['Cancel_date'].max())
# For each day in the range, number them as 0, 1, 2, 3, ...
rank = all_dates.to_series().rank().astype(np.int64) - 1
# Change from `2020-01-01` to "day 500 in the all_dates array", for example
start = test['Date'].map(rank).values
end = test['Cancel_date'].map(rank).values
start = start.astype(int)
end = end.astype(int)
# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))
# Now map day 500 back to Jan 1, 2020, day 501 back to Jan 2, 2020, and so on
dates = np.take(all_dates, indices)
# Align the rest of the columns to the expanded dates
duration = (end - start + 1).astype(np.int64)
ids = np.repeat(test['internal_id'], duration)
start_date = np.repeat(test['Date'], duration)
end_date = np.repeat(test['Cancel_date'], duration)
# Assemble the result
result = pd.DataFrame({
'start_date': start_date,
'end_date': end_date,
'internal_id': ids,
'Date': dates
})
问题是当我有一个 16k 记录数据帧时,索引变得太大而导致内存错误
# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))
【问题讨论】:
标签: python pandas numpy loops range