【问题标题】:Join two pandas dataframes based on lists columns根据列表列加入两个熊猫数据框
【发布时间】:2021-02-05 19:25:55
【问题描述】:

我有 2 个数据框,其中包含多列列表。
我想根据列表中的 2+ 共享价值观加入他们。示例:

ColumnA ColumnB        | ColumnA ColumnB        
id1     ['a','b','c']  | id3     ['a','b','c','x','y', 'z']
id2     ['a','d,'e']   | 

在这种情况下,我们可以看到 id1 匹配 id3,因为列表中有 2+ 个共享值。所以输出将是(列名并不重要,仅作为示例):

    ColumnA1 ColumnB1     ColumnA2   ColumnB2        
    id1      ['a','b','c']  id3     ['a','b','c','x','y', 'z']
    

我怎样才能达到这个结果?我尝试迭代数据帧 #1 中的每一行,但这似乎不是一个好主意。
谢谢!

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    使用行的笛卡尔积并检查每一行

    内嵌代码记录

    df1 = pd.DataFrame(
        {
            'ColumnA': ['id1', 'id2'],
            'ColumnB': [['a','b','c'], ['a','d','e']],
        }
    )
    
    df2 = pd.DataFrame(
        {
            'ColumnA': ['id3'],
            'ColumnB': [['a','b','c','x','y', 'z']],
        }
    )
    
    # Take cartesian product of both dataframes
    df1['k'] = 0
    df2['k'] = 0
    df = pd.merge(df1, df2, on='k').drop('k',1)
    # Check the overlap of the lists and find the overlap length
    df['overlap'] = df.apply(lambda x: len(set(x['ColumnB_x']).intersection(
                                       set(x['ColumnB_y']))), axis=1)
    # Select whoes overlap length > 2
    df = df[df['overlap'] > 2]
    print (df)
    

    输出:

      ColumnA_x  ColumnB_x ColumnA_y           ColumnB_y  overlap
    0       id1  [a, b, c]       id3  [a, b, c, x, y, z]        3
    

    【讨论】:

    • 精确。谢谢!
    【解决方案2】:

    如果您使用的是pandas 1.2.0 或更新版本(2020 年 12 月 26 日发布),笛卡尔积(交叉关节)可以简化如下:

        df = df1.merge(df2, how='cross')         # simplified cross joint for pandas >= 1.2.0
    

    另外,如果您担心系统性能(执行时间),建议使用list(map... 而不是较慢的apply(... axis=1)

    使用apply(... axis=1)

    %%timeit
    df['overlap'] = df.apply(lambda x: 
                             len(set(x['ColumnB1']).intersection(
                                 set(x['ColumnB2']))), axis=1)
    
    
    800 µs ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    使用list(map(...时:

    %%timeit
    df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))
    
    217 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    请注意,使用list(map... 的速度提高了 3 倍!

    整套代码供你参考:

        data = {'ColumnA1': ['id1', 'id2'], 'ColumnB1': [['a', 'b', 'c'], ['a', 'd', 'e']]}
        df1 = pd.DataFrame(data)
    
        data = {'ColumnA2': ['id3', 'id4'], 'ColumnB2': [['a','b','c','x','y', 'z'], ['d','e','f','p','q', 'r']]}
        df2 = pd.DataFrame(data)
    
        df = df1.merge(df2, how='cross')             # for pandas version >= 1.2.0
    
        df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))
    
        df = df[df['overlap'] >= 2]
        print (df)
    

    【讨论】:

      猜你喜欢
      • 2017-11-30
      • 2016-05-17
      • 2019-05-02
      • 1970-01-01
      • 2021-12-01
      • 2020-08-08
      • 1970-01-01
      • 2015-07-02
      • 2018-10-28
      相关资源
      最近更新 更多