【问题标题】:Multipling records base on difference of date between 2 columns根据两列之间的日期差异乘以记录
【发布时间】:2021-06-09 22:28:27
【问题描述】:

大家好,我在这里需要解决方案方面的帮助,问题是我需要获取创建日期和取消日期内每个日期的每个 ID,如下图所示

我有以下“有效”的代码(但仅适用于小型数据集):


# `date_range` is slow so we only call it once
all_dates = pd.date_range(test['Date'].min(), test['Cancel_date'].max())

# For each day in the range, number them as 0, 1, 2, 3, ...
rank = all_dates.to_series().rank().astype(np.int64) - 1

# Change from `2020-01-01` to "day 500 in the all_dates array", for example
start = test['Date'].map(rank).values
end = test['Cancel_date'].map(rank).values

start = start.astype(int)
end = end.astype(int)

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

# Now map day 500 back to Jan 1, 2020, day 501 back to Jan 2, 2020, and so on
dates = np.take(all_dates, indices)

# Align the rest of the columns to the expanded dates
duration = (end - start + 1).astype(np.int64)
ids = np.repeat(test['internal_id'], duration)
start_date = np.repeat(test['Date'], duration)
end_date = np.repeat(test['Cancel_date'], duration)

# Assemble the result
result = pd.DataFrame({
    'start_date': start_date,
    'end_date': end_date,
    'internal_id': ids,
    'Date': dates
})

问题是当我有一个 16k 记录数据帧时,索引变得太大而导致内存错误

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

【问题讨论】:

    标签: python pandas numpy loops range


    【解决方案1】:

    使用列的特定格式 (%d-%b) 加上与之相关的数据年份 ("%d-%b-%Y") 转换为 datetime。 使用applyaxis=1 使用函数data_range 遍历每一行。使用close="left"data_range 表示您只想要区间左侧日期的闭合区间。这将避免创建一对具有相同值的额外日期。使用explode 将每个data_range 元素转换为一行。

    由于半封闭区间,我们现在使用fillna,其中有一个空范围(具有相同创建和取消日期的行)。之后,使用dt.strftime 将日期转换回所需的日期格式。

    文件 sample.csv 用作输入

    id  Date    CancelDate daysDiff
    aaaaa   01-mar  01-mar  0
    bbbb    01-mar  05-mar  4
    cccc    03-mar  06-mar  3
    
    import pandas as pd
    
    df = pd.read_csv("sample.csv", sep='\s+')
    
    df["Date"] = pd.to_datetime(df.Date + "-2020", format="%d-%b-%Y")
    df["CancelDate"] = pd.to_datetime(df.CancelDate + "-2020", format="%d-%b-%Y")
    df = df.drop(columns=["daysDiff"])
    
    df["Date"]= df.apply(lambda x:
        pd.date_range(start=x.Date, end=x.CancelDate, closed="left")
    , axis=1)
    
    dout = df.explode("Date").reset_index(drop=True)
    dout["Date"] = dout.Date.fillna(dout.CancelDate)
    
    dout["CancelDate"] = dout.CancelDate.dt.strftime("%d-%b")
    dout["Date"] = dout.Date.dt.strftime("%d-%b")
    
    print(dout)
    

    dout的输出

          id    Date CancelDate
    0  aaaaa  01-Mar     01-Mar
    1   bbbb  01-Mar     05-Mar
    2   bbbb  02-Mar     05-Mar
    3   bbbb  03-Mar     05-Mar
    4   bbbb  04-Mar     05-Mar
    5   cccc  03-Mar     06-Mar
    6   cccc  04-Mar     06-Mar
    7   cccc  05-Mar     06-Mar
    

    【讨论】:

      猜你喜欢
      • 2023-01-12
      • 2019-08-26
      • 2021-10-21
      • 2011-10-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多