【问题标题】:Pandas joining dataframes with different intervals in matching columns熊猫在匹配列中加入具有不同间隔的数据帧
【发布时间】:2025-11-29 06:30:01
【问题描述】:

我对如何正确地提出问题感到有些困惑。我有两个 Pandas 数据框:

data = {'ID':['A1','A1','A2','A2','A2'], 'FROM':[0,2,0,2,4], 'TO':[2,4,2,4,6], 'PYR' : [0.25,0.11,0.05,0,0.5]}

df = pd.DataFrame(data, columns = ['ID', 'FROM', 'TO', 'PYR'])

所以 df 看起来像这样:

   ID  FROM  TO   PYR
0  A1     0   2  0.25
1  A1     2   4  0.11
2  A2     0   2  0.05
3  A2     2   4  0.00
4  A2     4   6  0.50

第二个:

new_data = {'ID':['A1','A2','A2'], 'FROM':[0, 0, 3.5], 'TO':[4, 3.5, 6], 'STRAT':['TD3', 'J1','J2']}

df2 = pd.DataFrame(new_data, columns = ['ID', 'FROM', 'TO', 'STRAT'])

   ID  FROM   TO STRAT
0  A1   0.0  4.0   TD3
1  A2   0.0  3.5    J1
2  A2   3.5  6.0    J2

我想要做的是将第二个数据帧中的 STRAT 添加到第一个数据帧。两个数据框的每个 id 都有相同的完整范围,但各个间隔明显不同。

我想要以这样的方式填写 STRAT,如果它与第一个数据帧的间隔的 50% 以上重叠,它将被分配到该间隔,因此预期结果如下所示:

   ID  FROM  TO   PYR STRAT
0  A1     0   2  0.25   TD3
1  A1     2   4  0.11   TD3
2  A2     0   2  0.05    J1
3  A2     2   4  0.00    J1
4  A2     4   6  0.50    J2

我不太确定如何解决这个问题。如果有人能指出我正确的方向,我将不胜感激。谢谢!

【问题讨论】:

    标签: python pandas dataframe join


    【解决方案1】:

    您可以做的是:合并数据框,然后过滤超出预期间隔的内容。这将是:

    # Merge
    df = df.merge(df2, on='ID', suffixes=('_1', '_2'))
    
    # Calculate interval overlap
    amount_overlap = (df[['TO_1', 'TO_2']].min(axis=1) -
        df[['FROM_2', 'FROM_1']].max(axis=1))
    
    # Filter rows where overlap under 50% of df TO-FROM interval
    df = df[(amount_overlap)/(df.TO_1 - df.FROM_1) > 0.5]
    

    如有必要,您可以恢复列名:

    df = df.rename(columns={'TO_1':'TO', 'FROM_1': 'FROM'})
    

    并删除不必要的列:

    df = df.drop(['TO_2', 'FROM_2'], axis=1)
    

    【讨论】:

    • 当两个间隔不重叠时,这会给出错误的答案,例如FROM_1 < TO_1 < FROM_2 < TO_2
    • 我不这么认为,在这种情况下,amount_overlap 变量将是负数,小于 0.5。
    • 那是我的错,您的解决方案似乎确实解决了 OP 对案例 0.5 的问题
    • 谢谢大家。我已经测试了这两个版本,它们都在实际数据上给出了相同的结果。我会去接受贝尔纳多的回答,因为这对我来说更容易理解。
    【解决方案2】:

    我会在ID 上进行交叉连接,然后过滤那些有效的(FROM-TO 重叠),然后 groupby ID, FROM, TO 并获取最大重叠

    new_df = (df.merge(df2, on='ID', suffixes=['','_tmp'])
               .query('(FROM_tmp <= FROM & TO <= TO_tmp) | \
                       (FROM <= FROM_tmp <= TO) | \
                       (FROM <= TO_tmp <= TO)'
                     )
    )
    s1 = (new_df['FROM_tmp'].le(new_df['FROM']) &
          new_df['TO'].le( new_df['TO_tmp'])
         )
    s2 = (new_df['FROM_tmp'].ge(new_df['FROM']) &
          new_df['FROM_tmp'].le( new_df['TO'])
         )
    new_df['overlap'] = np.select((s1,s2),
                                  (new_df['TO_tmp'] - new_df['FROM_tmp'],
                                   new_df['TO'] - new_df['FROM_tmp']),
                                   new_df['TO_tmp'] - new_df['FROM']                            
                                 )
    
    # output
    (new_df.loc[new_df.groupby(['ID','FROM', 'TO'])
                   .overlap.idxmax()]
         .drop(['FROM_tmp', 'TO_tmp', 'overlap'], axis=1)
    )
    

    输出:

       ID  FROM  TO   PYR STRAT
    0  A1     0   2  0.25   TD3
    1  A1     2   4  0.11   TD3
    2  A2     0   2  0.05    J1
    4  A2     2   4  0.00    J1
    7  A2     4   6  0.50    J2
    

    【讨论】: