【问题标题】:Pandas: how to merge two dataframes on a column by keeping the information of the first one?Pandas:如何通过保留第一个数据框的信息来合并列上的两个数据框?
【发布时间】:2019-03-31 07:31:18
【问题描述】:

我有两个数据框 df1df2df1 包含人们的年龄信息,而df2 包含人们的性别信息。不是所有人都在df1 也不是df2

df1
     Name   Age 
0     Tom    34
1     Sara   18
2     Eva    44
3     Jack   27
4     Laura  30

df2
     Name      Sex 
0     Tom       M
1     Paul      M
2     Eva       F
3     Jack      M
4     Michelle  F

如果我在df2 中没有此信息,我想在df1 中获取人们的性别信息并设置NaN。我尝试做df1 = pd.merge(df1, df2, on = 'Name', how = 'outer'),但我将一些我不想要的人的信息保留在df2

df1
     Name   Age     Sex
0     Tom    34      M
1     Sara   18     NaN
2     Eva    44      F
3     Jack   27      M
4     Laura  30     NaN

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    Sample:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    #print (df1)
    df3 = df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F']})
    #print (df2)
    

    使用由Series创建的mapset_index

    df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
    print (df1)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    merge 左连接的替代解决方案:

    df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
    print (df)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    如果需要多列映射(例如YearCode)需要merge 左连接:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Year':[2000,2003,2003,2004,2007],
                        'Code':[1,2,3,4,4],
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Year  Code  Age
    0    Tom  2000     1   34
    1   Sara  2003     2   18
    2    Eva  2003     3   44
    3   Jack  2004     4   27
    4  Laura  2007     4   30
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F'],
                        'Year':[2001,2003,2003,2004,2007],
                        'Code':[1,2,3,5,3],
                        'Val':[21,34,23,44,67]})
    print (df2)
           Name Sex  Year  Code  Val
    0       Tom   M  2001     1   21
    1      Paul   M  2003     2   34
    2       Eva   F  2003     3   23
    3      Jack   M  2004     5   44
    4  Michelle   F  2007     3   67
    
    #merge by all columns
    df = df1.merge(df2, on=['Year','Code'], how='left')
    print (df)
      Name_x  Year  Code  Age Name_y  Sex   Val
    0    Tom  2000     1   34    NaN  NaN   NaN
    1   Sara  2003     2   18   Paul    M  34.0
    2    Eva  2003     3   44    Eva    F  23.0
    3   Jack  2004     4   27    NaN  NaN   NaN
    4  Laura  2007     4   30    NaN  NaN   NaN
    
    #specified columns - columns for join (Year, Code) need always + appended columns (Val)
    df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
    print (df)
        Name  Year  Code  Age   Val
    0    Tom  2000     1   34   NaN
    1   Sara  2003     2   18  34.0
    2    Eva  2003     3   44  23.0
    3   Jack  2004     4   27   NaN
    4  Laura  2007     4   30   NaN
    

    如果map 出现错误,则表示连接列重复,此处为Name

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Age
    0    Tom   34
    1   Sara   18
    2    Eva   44
    3   Jack   27
    4  Laura   30
    
    df3, df4 = df1.copy(), df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                        'Val': [1,2,3,4,5]})
    print (df2)
           Name  Val
    0       Tom    1 <-duplicated name Tom
    1       Tom    2 <-duplicated name Tom
    2       Eva    3
    3      Jack    4
    4  Michelle    5
    
    s = df2.set_index('Name')['Val']
    df1['New'] = df1['Name'].map(s)
    print (df1)
    

    InvalidIndexError:重新索引仅对具有唯一值的索引对象有效

    解决方案由DataFrame.drop_duplicates 删除重复项,或使用dict 的映射进行最后一次重复匹配:

    #default keep first value
    s = df2.drop_duplicates('Name').set_index('Name')['Val']
    print (s)
    Name
    Tom         1
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df1['New'] = df1['Name'].map(s)
    print (df1)
        Name  Age  New
    0    Tom   34  1.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #add parameter for keep last value 
    s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
    print (s)
    Name
    Tom         2
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df3['New'] = df3['Name'].map(s)
    print (df3)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #map by dictionary
    d = dict(zip(df2['Name'], df2['Val']))
    print (d)
    {'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
    
    df4['New'] = df4['Name'].map(d)
    print (df4)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    

    【讨论】:

    • 你好,当第二个数据帧有不同的行数时,如何使用df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])?我在我的数据集上使用它,我只收到第一行的结果,谢谢
    • @sygneto - 它应该工作,值匹配? print (df1['Sex'].unique())print (df2['Sex'].unique()) 的回报是什么?
    • 我拥有所有 uniqe 值,但在我的情况下,此列 df1['sex'] 已经存在并且在每一行中都有值 =0,您认为如何替换它?或者可能在地图之前删除此列?
    • @sygneto - 对我来说不容易看到问题,因为看不到您的数据。 :(
    • 我认为原因是因为我已经在两个数据框中都有 ['sex'] 列,我该如何替换或附加它?
    【解决方案2】:

    您也可以使用join 方法:

    df1.set_index("Name").join(df2.set_index("Name"), how="left")
    

    编辑:添加set_index("Name")

    【讨论】:

      【解决方案3】:

      @jezrael 答案的简单补充,用于从数据框创建字典。

      这可能会有所帮助..

      Python:

      df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
                          'Age': [34, 18, 44, 27, 30]})
      
      
      df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Paul', 'Jack', 'Michelle', 'Tom'],
                          'Something': ['M', 'M', 'F', 'M', 'A', 'F', 'B']})
      
      
      df1_dict = pd.Series(df1.Age.values, index=df1.Name).to_dict()
      
      df2['Age'] = df2['Name'].map(df1_dict)
      
      print(df2)
      

      输出:

            Name Something   Age
      0       Tom         M  34.0
      1      Paul         M   NaN
      2       Eva         F  44.0
      3      Paul         M   NaN
      4      Jack         A  27.0
      5  Michelle         F   NaN
      6       Tom         B  34.0
      

      【讨论】:

        【解决方案4】:

        Reindexing 尚未提及,但它非常快并且可以根据需要自动填充缺失值。


        DataFrame.reindex

        使用公共键(Name)作为映射数据帧的索引(df2):

        • 如果df2的索引已经是Name,直接reindex即可:

          df2['Sex'].reindex(df1['Name'])
          
        • 否则提前set_index

          df2.set_index('Name')['Sex'].reindex(df1['Name'])
          

        请注意,当分配到现有数据帧时,重新索引的索引将未对齐,因此仅分配数组值:

        df1['Sex'] = df2.set_index('Name')['Sex'].reindex(df1['Name']).array
        
        #     Name  Age  Sex
        # 0    Tom   34    M
        # 1   Sara   18  NaN
        # 2    Eva   44    F
        # 3   Jack   27    M
        # 4  Laura   30  NaN
        

        我还注意到一个常见的假设,即重新索引很慢,但实际上很快(est):


        填补缺失值

        reindex支持自动填充缺失值:

        • fill_value:静态替换
        • method: 给定单调索引的算法替换(ffillbfillnearest

        例如,用 Prefer not say (PNS) 填充空的 Sex 值:

        df2.set_index('Name')['Sex'].reindex(df1['Name'], fill_value='PNS')
        
        #     Name  Age  Sex
        # 0    Tom   34    M
        # 1   Sara   18  PNS
        # 2    Eva   44    F
        # 3   Jack   27    M
        # 4  Laura   30  PNS
        

        使用fill_value 重新索引比链接fillna 更快:


        处理重复项

        映射数据框(df2)不能有重复的键,所以drop_duplicates如果适用:

        df2.drop_duplicates('Name').set_index('Name')['Sex'].reindex(df1['Name'])
        

        时序数据:

        '''
        Note: This is python code in a js snippet, so "run code snippet" will not work.
        The snippet is just to avoid cluttering the main post with supplemental code.
        '''
        
        df1 = pd.DataFrame({'Name': np.arange(n), 'Age': np.random.randint(100, size=n)}).sample(frac=1).reset_index(drop=True)
        df2 = pd.DataFrame({'Name': np.arange(n) + int(n * 0.5), 'Sex': np.random.choice(list('MF'), size=n)}).sample(frac=1).reset_index(drop=True)
        
        def reindex_(df1, df2):
            df1['Sex'] = df2.set_index('Name')['Sex'].reindex(df1['Name']).array
            return df1
        
        def map_(df1, df2):
            df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
            return df1
        
        def dict_(df1, df2):
            df1['Sex'] = df1['Name'].map(dict(zip(df2['Name'], df2['Sex'])))
            return df1
        
        def merge_(df1, df2):
            return df1.merge(df2[['Name', 'Sex']], left_on='Name', right_on='Name', how='left')
        
        def join_(df1, df2):
            return df1.set_index('Name').join(df2.set_index('Name'), how='left').reset_index()
        
        reindex_fill_value_ = lambda df1, df2: df2.set_index('Name')['Sex'].reindex(df1['Name'], fill_value='PNTS')
        reindex_fillna_ = lambda df1, df2: df2.set_index('Name')['Sex'].reindex(df1['Name']).fillna('PNTS')
        map_fillna_ = lambda df1, df2: df1['Name'].map(df2.set_index('Name')['Sex']).fillna('PNTS')

        【讨论】:

          猜你喜欢
          相关资源
          最近更新 更多