根据两列之间的日期差异乘以记录答案

【问题标题】：Multipling records base on difference of date between 2 columns根据两列之间的日期差异乘以记录
【发布时间】：2021-06-09 22:28:27
【问题描述】：

大家好，我在这里需要解决方案方面的帮助，问题是我需要获取创建日期和取消日期内每个日期的每个 ID，如下图所示

我有以下“有效”的代码（但仅适用于小型数据集）：


# `date_range` is slow so we only call it once
all_dates = pd.date_range(test['Date'].min(), test['Cancel_date'].max())

# For each day in the range, number them as 0, 1, 2, 3, ...
rank = all_dates.to_series().rank().astype(np.int64) - 1

# Change from `2020-01-01` to "day 500 in the all_dates array", for example
start = test['Date'].map(rank).values
end = test['Cancel_date'].map(rank).values

start = start.astype(int)
end = end.astype(int)

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

# Now map day 500 back to Jan 1, 2020, day 501 back to Jan 2, 2020, and so on
dates = np.take(all_dates, indices)

# Align the rest of the columns to the expanded dates
duration = (end - start + 1).astype(np.int64)
ids = np.repeat(test['internal_id'], duration)
start_date = np.repeat(test['Date'], duration)
end_date = np.repeat(test['Cancel_date'], duration)

# Assemble the result
result = pd.DataFrame({
    'start_date': start_date,
    'end_date': end_date,
    'internal_id': ids,
    'Date': dates
})

问题是当我有一个 16k 记录数据帧时，索引变得太大而导致内存错误

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

【问题讨论】：

标签： python pandas numpy loops range

【解决方案1】：

使用列的特定格式 (%d-%b) 加上与之相关的数据年份 ("%d-%b-%Y") 转换为 datetime。使用apply 和axis=1 使用函数data_range 遍历每一行。使用close="left" 和data_range 表示您只想要区间左侧日期的闭合区间。这将避免创建一对具有相同值的额外日期。使用explode 将每个data_range 元素转换为一行。

由于半封闭区间，我们现在使用fillna，其中有一个空范围（具有相同创建和取消日期的行）。之后，使用dt.strftime 将日期转换回所需的日期格式。

文件 sample.csv 用作输入

id  Date    CancelDate daysDiff
aaaaa   01-mar  01-mar  0
bbbb    01-mar  05-mar  4
cccc    03-mar  06-mar  3

import pandas as pd

df = pd.read_csv("sample.csv", sep='\s+')

df["Date"] = pd.to_datetime(df.Date + "-2020", format="%d-%b-%Y")
df["CancelDate"] = pd.to_datetime(df.CancelDate + "-2020", format="%d-%b-%Y")
df = df.drop(columns=["daysDiff"])

df["Date"]= df.apply(lambda x:
    pd.date_range(start=x.Date, end=x.CancelDate, closed="left")
, axis=1)

dout = df.explode("Date").reset_index(drop=True)
dout["Date"] = dout.Date.fillna(dout.CancelDate)

dout["CancelDate"] = dout.CancelDate.dt.strftime("%d-%b")
dout["Date"] = dout.Date.dt.strftime("%d-%b")

print(dout)

dout的输出

      id    Date CancelDate
0  aaaaa  01-Mar     01-Mar
1   bbbb  01-Mar     05-Mar
2   bbbb  02-Mar     05-Mar
3   bbbb  03-Mar     05-Mar
4   bbbb  04-Mar     05-Mar
5   cccc  03-Mar     06-Mar
6   cccc  04-Mar     06-Mar
7   cccc  05-Mar     06-Mar

【讨论】：