【问题标题】:Full outer join of two or more data frames两个或多个数据帧的完全外连接
【发布时间】:2018-04-12 11:57:56
【问题描述】:

给定以下三个 Pandas 数据帧,我需要将它们合并,类似于 SQL 全外连接。请注意,键是多索引type_Nid_N N = 1,2,3:

import pandas as pd

raw_data = {
        'type_1': [0, 1, 1,1],
        'id_1': ['3', '4', '5','5'],
        'name_1': ['Alex', 'Amy', 'Allen', 'Jane']}
df_a = pd.DataFrame(raw_data, columns = ['type_1', 'id_1', 'name_1' ])

raw_datab = {
        'type_2': [1, 1, 1, 0],
        'id_2': ['4', '5', '5', '7'],
        'name_2': ['Bill', 'Brian', 'Joe', 'Bryce']}
df_b = pd.DataFrame(raw_datab, columns = ['type_2', 'id_2', 'name_2'])

raw_datac = {
        'type_3': [1, 0],
        'id_3': ['4', '7'],
        'name_3': ['School', 'White']}
df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'id_3', 'name_3'])

预期的结果应该是:

type_1   id_1   name_1   type_2   id_2   name_2   type_3   id_3   name_3
0        3      Alex     NaN      NaN    NaN      NaN      NaN    NaN
1        4      Amy      1        4      Bill     1        4      School
1        5      Allen    1        5      Brian    NaN      NaN    NaN
1        5      Allen    1        5      Joe      NaN      NaN    NaN
1        5      Jane     1        5      Brian    NaN      NaN    NaN
1        5      Jane     1        5      Joe      NaN      NaN    NaN
NaN      NaN    NaN      0        7      Bryce    0        7      White

如何在 Pandas 中实现这一点?

【问题讨论】:

    标签: python python-3.x pandas


    【解决方案1】:

    我建议你让生活变得不那么复杂,不要为你想要合并的东西使用不同的名称。

    da = df_a.set_index(['type_1', 'id_1']).rename_axis(['type', 'id'])
    db = df_b.set_index(['type_2', 'id_2']).rename_axis(['type', 'id'])
    dc = df_c.set_index(['type_3', 'id_3']).rename_axis(['type', 'id'])
    
    da.join(db, how='outer').join(dc, how='outer')
    
            name_1 name_2  name_3
    type id                      
    0    3    Alex    NaN     NaN
         7     NaN  Bryce   White
    1    4     Amy   Bill  School
         5   Allen  Brian     NaN
         5   Allen    Joe     NaN
         5    Jane  Brian     NaN
         5    Jane    Joe     NaN
    

    这是获取其他列的一种令人讨厌的方法

    from cytoolz.dicttoolz import merge
    
    i = pd.DataFrame(d.index.values.tolist(), d.index, d.index.names)
    d = d.assign(**merge(
        i.mask(d[f'name_{j}'].isna()).add_suffix(f'_{j}').to_dict('l')
        for j in [1, 2, 3]
    ))
    
    d[sorted(d.columns, key=lambda x: x.split('_')[::-1])]
    
            id_1 name_1  type_1 id_2 name_2  type_2 id_3  name_3  type_3
    type id                                                             
    0    3     3   Alex     0.0  NaN    NaN     NaN  NaN     NaN     NaN
         7   NaN    NaN     NaN    7  Bryce     0.0    7   White     0.0
    1    4     4    Amy     1.0    4   Bill     1.0    4  School     1.0
         5     5  Allen     1.0    5  Brian     1.0  NaN     NaN     NaN
         5     5  Allen     1.0    5    Joe     1.0  NaN     NaN     NaN
         5     5   Jane     1.0    5  Brian     1.0  NaN     NaN     NaN
         5     5   Jane     1.0    5    Joe     1.0  NaN     NaN     NaN
    

    【讨论】:

    • 是的,这是我在他之前的问题中向他推荐的 :-) left 更容易
    • (-: 我只是改成了外层。不过还是更简单。
    • 我可以在索引中使用相同的名称,但我仍然需要结果具有与我在问题中所述的预期结果相同的列数
    【解决方案2】:

    您可以使用 2 次连续合并,首先在 df_adf_b 上,然后在 df_c 上:

    In [49]: df_temp = df_a.merge(df_b, how='outer', left_on=['type_1', 'id_1'], right_on=['type_2', 'id_2'])
    
    In [50]: df_temp.merge(df_c, how='outer', left_on=['type_2', 'id_2'], right_on=['type_3', 'id_3'])
    Out[50]:
       type_1 id_1 name_1 type_2 id_2 name_2  type_3 id_3  name_3
    0     0.0    3   Alex    NaN  NaN    NaN     NaN  NaN     NaN
    1     1.0    4    Amy      1    4   Bill     1.0    4  School
    2     1.0    5  Allen      1    5  Brian     NaN  NaN     NaN
    3     1.0    5  Allen      1    5    Joe     NaN  NaN     NaN
    4     1.0    5   Jane      1    5  Brian     NaN  NaN     NaN
    5     1.0    5   Jane      1    5    Joe     NaN  NaN     NaN
    6     NaN  NaN    NaN      0    7  Bryce     0.0    7   White
    

    【讨论】:

      【解决方案3】:

      让我们尝试为此创建一个新密钥,我在这里使用reduce

      import functools
      dfs=[df_a,df_b,df_c]
      dfs=[x.assign(key=list(zip(x.iloc[:,0],x.iloc[:,1]))) for x in dfs]
      merged_df = functools.reduce(lambda left,right: pd.merge(left,right,on='key',how='outer'), dfs)
      merged_df.drop('key',1) 
      Out[110]: 
         type_1 id_1 name_1  type_2 id_2 name_2  type_3 id_3  name_3
      0     0.0    3   Alex     NaN  NaN    NaN     NaN  NaN     NaN
      1     1.0    4    Amy     1.0    4   Bill     1.0    4  School
      2     1.0    5  Allen     1.0    5  Brian     NaN  NaN     NaN
      3     1.0    5  Allen     1.0    5    Joe     NaN  NaN     NaN
      4     1.0    5   Jane     1.0    5  Brian     NaN  NaN     NaN
      5     1.0    5   Jane     1.0    5    Joe     NaN  NaN     NaN
      6     NaN  NaN    NaN     0.0    7  Bryce     0.0    7   White
      

      【讨论】:

        猜你喜欢
        • 2020-06-10
        • 2020-10-08
        • 1970-01-01
        • 2020-03-30
        • 2011-05-07
        • 2021-01-04
        • 2018-05-10
        • 2014-09-14
        • 2016-10-30
        相关资源
        最近更新 更多