【问题标题】:Find date range overlap in python在python中查找日期范围重叠
【发布时间】:2024-05-03 02:45:02
【问题描述】:

我正在尝试找到一种更有效的方法来根据特定列 (id) 在数据框中查找重叠数据范围(每行提供的开始/结束日期)。数据框按“来自”列排序。我认为有一种方法可以像我一样避免双重 apply 函数:

import pandas as pd
from datetime import datetime

df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
                  data=[[878,'2006-01-01','2007-10-01'],
                        [878,'2007-10-02','2008-12-01'],
                        [878,'2008-12-02','2010-04-03'],
                        [879,'2010-04-04','2199-05-11'],
                        [879,'2016-05-12','2199-12-31']])

df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])


    id  from        to
0   878 2006-01-01  2007-10-01
1   878 2007-10-02  2008-12-01
2   878 2008-12-02  2010-04-03
3   879 2010-04-04  2199-05-11
4   879 2016-05-12  2199-12-31

我使用“apply”函数在所有组和每个组内循环,每行使用“apply”:

def check_date_by_id(df):
    
    df['prevFrom'] = df['from'].shift()
    df['prevTo'] = df['to'].shift()
    
    def check_date_by_row(x):
        
        if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
            x['overlap'] = False
            return x
        
        latest_start = max(x['from'], x.prevFrom)
        earliest_end = min(x['to'], x.prevTo)
        x['overlap'] = int((earliest_end - latest_start).days) + 1 > 0
        return x
    
    return df.apply(check_date_by_row, axis=1).drop(['prevFrom','prevTo'], axis=1)

df.groupby('id').apply(check_date_by_id)

    id  from        to          overlap
0   878 2006-01-01  2007-10-01  False
1   878 2007-10-02  2008-12-01  False
2   878 2008-12-02  2010-04-03  False
3   879 2010-04-04  2199-05-11  False
4   879 2016-05-12  2199-12-31  True

我的代码灵感来自以下链接:

【问题讨论】:

    标签: python pandas time-series


    【解决方案1】:

    您可以移动to 列并直接减去日期时间。

    df['overlap'] = (df['to'].shift()-df['from']) > timedelta(0)
    

    在按id 分组时应用它可能看起来像

    df['overlap'] = (df.groupby('id')
                       .apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
                       .reset_index(level=0, drop=True))
    

    演示

    >>> df
        id       from         to
    0  878 2006-01-01 2007-10-01
    1  878 2007-10-02 2008-12-01
    2  878 2008-12-02 2010-04-03
    3  879 2010-04-04 2199-05-11
    4  879 2016-05-12 2199-12-31
    
    >>> df['overlap'] = (df.groupby('id')
                           .apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
                           .reset_index(level=0, drop=True))
    
    >>> df
        id       from         to overlap
    0  878 2006-01-01 2007-10-01   False
    1  878 2007-10-02 2008-12-01   False
    2  878 2008-12-02 2010-04-03   False
    3  879 2010-04-04 2199-05-11   False
    4  879 2016-05-12 2199-12-31    True
    

    【讨论】:

    • 谢谢。简单明了。您是否知道如何执行相同的操作(groupby + check),但对于所有日期而不仅仅是连续的日期?
    • 我不完全确定您的意思...如果日期已排序,还能完成什么?我为您添加了一个按id 分组的示例。
    【解决方案2】:

    另一种解决方案。这可以重写以利用 Pandas 24 及更高版本中的 Interval.overlaps。

    def overlapping_groups(group):
        if len(group) > 1:
          for index, row in group.iterrows():
            for index2, row2 in group.drop(index).iterrows():
              int1 = pd.Interval(row2['start_date'],row2['end_date'], closed = 'both')
              if row['start_date'] in int1:
                return row['id']
              if row['end_date'] in int1:
                return row['id']
    
    gcols = ['id']
    group_output = df.groupby(gcols,group_keys=False).apply(overlapping_groups)
    ids_with_overlap = set(group_output[~group_output.isnull()].reset_index(drop = True))
    df[df['id'].isin(ids_with_overlap)]
    

    【讨论】:

      【解决方案3】:

      您可以将“从”时间与之前的“到”时间进行比较:

      df['to'].shift() > df['from']
      

      输出:

      0    False
      1    False
      2    False
      3    False
      4     True
      

      【讨论】:

        【解决方案4】:

        您可以对from 列进行排序,然后简单地检查它是否与之前的to 列重叠,或者不使用非常有效的滚动应用功能。

        df['from'] = pd.DatetimeIndex(df['from']).astype(np.int64)
        df['to'] = pd.DatetimeIndex(df['to']).astype(np.int64)
        
        sdf = df.sort_values(by='from')
        sdf[["from", "to"]].stack().rolling(window=2).apply(lambda r: 1 if r[1] >= r[0] else 0).unstack()
        

        现在重叠的时期是from=0.0的时期

           from   to
        0   NaN  1.0
        1   1.0  1.0
        2   1.0  1.0
        3   1.0  1.0
        4   0.0  1.0
        

        【讨论】:

          【解决方案5】:

          自从我遇到与您类似的问题以来,我一直在进行广泛的浏览。我遇到了这个解决方案 this solution。 它使用来自 pandas 的函数 overlaps,这里有详细记录: here.

          def function(df):
              timeintervals = pd.IntervalIndex.from_arrays(df.from,df.to,closed='both')
              index = np.arange(timeintervals.size)
              index_to_keep=[]
              for intervals in timeintervals:
                  index_to_keep.append([0])
                  control = timeintervals[index].overlaps(timeintervals[index[0]])
                  if control.any():
                      index = index[~control]
                  else:
                      break
                  if index.size==0:
                      break
                  temp = df.index[index_to_keep]
                  output = df.loc[temp]
                  return output
          

          【讨论】: