【问题标题】:Pandas: Merge two dataframes on id and matching date range within +/- certain daysPandas:在 +/- 特定日期内合并 id 和匹配日期范围的两个数据框
【发布时间】:2021-05-28 10:57:40
【问题描述】:

目标:

我想根据唯一编号和 +/-7 天内的日期匹配合并两个数据框

数据:

df1

Number         Report         DateDone
1       some words      13/1/2021
1               more stuff      21/8/2021
44      balbla          11/4/2020
2       gobbledy bla    01/03/2019
44      rara rasputin   13/10/2021
44      tree frogs      11/10/2010

df2

Number         Report             DateDone
1       hocum poklum       11/1/2021
1       mjimmeny cricket   21/8/2021
44      it wasnt me        11/2/2020
2       its not really     6/03/2019
44      im innocent        12/10/2021
44      bullfrogs          11/01/2010

预期结果

Number.df1     Report.df1   DateDone.df1     Number.df2    Report.df2     DateDone.df2
1              some words    13/1/2021              1          hocum poklum      11/1/2021
1              more stuff    21/8/2021              1          jimmeny cricket   21/8/2021
2              gobbledy bla  01/03/2019             2          its not really    6/03/2019
44             rara rasputin 13/10/2021             44         im innocent       12/10/2021

我打算使用类似于我找到的here 的 sql 合并,但我很难知道如何合并数字和日期范围。我是否需要计算 df1 中 DateDone 前后的 7 天?肯定有比必须先计算两个新列更有效的方法吗?

qry = '''
    select  
        df1.DateDone_start TermStart,
        df1.DateDone_end TermEnd,
        df2.DateDone df2Start,
        df1.Number,
        df2.Number
    from
        df1 join df2 on
        date between df1.DateDone_start and df1.DateDone_end join df1 on
        df1.Number = df2.Number
    '''
df = pd.read_sql_query(qry, conn)

【问题讨论】:

    标签: python sql pandas


    【解决方案1】:

    您可以在Number 上使用.merge(),然后使用.loc 过滤DateDone.df2.between() DateDone.df1 +/- 7 天的条件,使用 +/-pd.DateOffset(days=7),如下所示:

    df1['DateDone'] = pd.to_datetime(df1['DateDone'], dayfirst=True)
    df2['DateDone'] = pd.to_datetime(df2['DateDone'], dayfirst=True)
    
    df_merge = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
    
    result = df_merge.loc[
                 df_merge['DateDone.df2'].between(
                     df_merge['DateDone.df1'] - pd.DateOffset(days=7), 
                     df_merge['DateDone.df1'] + pd.DateOffset(days=7))]
    

    结果:

    print(result)
    
    
    
        Number     Report.df1 DateDone.df1        Report.df2 DateDone.df2
    0        1     some words   2021-01-13      hocum poklum   2021-01-11
    3        1     more stuff   2021-08-21  mjimmeny cricket   2021-08-21
    8       44  rara rasputin   2021-10-13       im innocent   2021-10-12
    13       2   gobbledy bla   2019-03-01    its not really   2019-03-06
    

    【讨论】:

      【解决方案2】:

      尝试merge,然后过滤掉 7 天内的行:

      new_df = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
      new_df = new_df[
          abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
          ]
      

      new_df:

          Number     Report.df1 DateDone.df1        Report.df2 DateDone.df2
      0        1     some words   2021-01-13      hocum poklum   2021-01-11
      3        1     more stuff   2021-08-21  mjimmeny cricket   2021-08-21
      8       44  rara rasputin   2021-10-13       im innocent   2021-10-12
      13       2   gobbledy bla   2019-03-01   its not really    2019-03-06
      

      如果尚未完成,则将两个帧的“DateDone”转换为 DateTime:

      df1['DateDone'] = pd.to_datetime(df1['DateDone'], format='%d/%m/%Y')
      df2['DateDone'] = pd.to_datetime(df2['DateDone'], format='%d/%m/%Y')
      

      获取两个日期时间之间的持续时间

      new_df['DateDone.df1'] - new_df['DateDone.df2']
      
      0        2 days
      1     -220 days
      2      222 days
      3        0 days
      4       60 days
      5     -549 days
      6     3743 days
      7      610 days
      8        1 days
      9     4293 days
      10   -3410 days
      11   -4019 days
      12     273 days
      13      -5 days
      dtype: timedelta64[ns]
      

      应用abs 从持续时间中移除方向性并与所需持续时间进行比较:

      abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
      

      使用此索引来确定要保留哪些行:

      0      True
      1     False
      2     False
      3      True
      4     False
      5     False
      6     False
      7     False
      8      True
      9     False
      10    False
      11    False
      12    False
      13     True
      dtype: bool
      

      【讨论】:

        猜你喜欢
        • 2018-11-14
        • 1970-01-01
        • 1970-01-01
        • 2014-05-30
        • 2019-07-02
        • 1970-01-01
        • 2017-02-23
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多