【问题标题】:cross checking - consolodating pandas dataframes (pd.concat, pd.merge don't seem to work in this case)交叉检查 - 整合 pandas 数据帧(pd.concat、pd.merge 在这种情况下似乎不起作用)
【发布时间】:2021-02-13 20:02:14
【问题描述】:

我正在尝试将在某些情况下相互补充的两个数据集组合成一个新的数据框,而不会重复列。换句话说,我有两个数据框。在某些列(在两个数据框中具有相同的名称)中,我需要的信息将在任一列中,但不会在两者中。

例如,请参阅我创建的以下虚拟数据框,这些数据框反映了手头的问题。这些数据框包含相同 3 个人的信息。注意“性别”栏。当其中一个值缺失时,会在另一个中找到该值,反之亦然。两列值的组合为我们提供了一个完整的性别列。理想情况下,我会在下面得到 df_need。

(实际的数据集有很多列,比如性别)

df_have1 = pd.DataFrame({'age':[7,34,19], 'gender':['F',np.nan,'M'], 'profession':['student', 'CEO', 'artist']})
df_have1

df_have2 = pd.DataFrame({'age':[7,34,19], 'gender':['np.nan','F',np.nan], 'interests':['acting', 'cars', 'gardening']})
df_have2

df_need = pd.DataFrame({'age':[7,34,19], 'gender':['F','F','M'], 'profession':['student', 'CEO', 'artist'], 'interests':['acting', 'cars', 'gardening']})
df_need

我尝试了 pd.concat,不幸的是它重复了性别列。 pd.merge 和 join 也是如此。

pd.concat([df_have1, df_have2], axis=1)

【问题讨论】:

    标签: python pandas validation merge concatenation


    【解决方案1】:
    • merge()join() 将与列 suffixes 一起使用
    • 使用数据到fillna()
    • 完成后删除不需要的列
    df_have1 = pd.DataFrame({'age':[7,34,19], 'gender':['F',np.nan,'M'], 'profession':['student', 'CEO', 'artist']})
    
    df_have2 = pd.DataFrame({'age':[7,34,19], 'gender':['np.nan','F',np.nan], 'interests':['acting', 'cars', 'gardening']})
    
    df_need = (df_have1.join(df_have2, rsuffix="_r")
     .assign(gender=lambda dfa: dfa.gender.fillna(dfa.gender_r))
     .drop(columns=["age_r","gender_r"])
    )
    
    
    age gender profession interests
    0 7 F student acting
    1 34 F CEO cars
    2 19 M artist gardening

    【讨论】:

      【解决方案2】:

      改编自@Rob Raymond 建议的更通用的代码

      def replace_str_nan_by_np_nan(df_str_nan):
          """
              dealing with nan strings, since fillna handles only np.nan
              
              Args: df with string nan
              
              Return: df with np.nan
          
          Ex: 
              import pandas as pd
              import numpy as np
                      
              df_str_nan = pd.DataFrame({
                  'age':['np.nan',34,19], 
                  'gender':['Nan',np.nan,'M'], 
                  'profession':['student', 'nan', 'artist']})
              df_np_nan = replace_str_nan_by_np_nan(df_str_nan)              
              print(df_np_nan.isna())
                  age     gender  profession
              0   True    True    False
              1   False   True    True
              2   False   False   False
          """
          import numpy as np
          
          df_np_nan = df_str_nan.copy()
          for nan in ['np.nan', 'NaN', 'Nan', 'nan']:  
              df_np_nan = df_np_nan.replace(nan, np.nan, regex=True)
              
          return df_np_nan
      
      
      def join_df1_df2_repeated_col(df1, df2):
          """
              join two dataframes keeping values within repeated columns 
              dealing with nan strings, since fillna handles only np.nan
              
              Args: df1, df2 two dataframes
              
              Return: df_join joined dataframe
          
          Ex: 
              import pandas as pd
              import numpy as np
              
      
              df1 = pd.DataFrame({
                  'age':[7,34,19], 
                  'gender':['F',np.nan,'M'], 
                  'profession':['student', 'CEO', 'artist']})
              df2 = pd.DataFrame({
                  'age':[7,34,19], 
                  'gender':['np.nan','F',np.nan], 
                  'interests':['acting', 'cars', 'gardening']})
      
              print(join_df1_df2_repeated_col(df1, df2))
              
                  age gender  profession  interests
              0   7   F       student     acting
              1   34  F       CEO         cars
              2   19  M       artist      gardening
          """
          import pandas as pd
          import numpy as np
          
          
          # dealing with nan strings, since fillna handles only np.nan
          df1 = replace_str_nan_by_np_nan(df1)
          df2 = replace_str_nan_by_np_nan(df2)
          
          rsuffix = "_r"
          df_join = df1.join(df2, rsuffix=rsuffix)
          
          # dealing with repeated columns
          mask = df_join.columns.str.endswith(rsuffix)
          lst_col_r = list(df_join.loc[:,mask].columns)
          for col_r in lst_col_r:
              col = col_r[:-len(rsuffix)]
              df_join[col] = df_join[col].fillna(df_join[col_r])   
          
          return df_join.drop(columns=lst_col_r)
      
      
      import pandas as pd
      import numpy as np
      
      df1 = pd.DataFrame({
          'age':[7,34,19], 
          'gender':['F',np.nan,'M'], 
          'profession':['student', 'CEO', 'artist']})
      df2 = pd.DataFrame({
          'age':[7,34,19], 
          'gender':['np.nan','F',np.nan], 
          'interests':['acting', 'cars', 'gardening']})
      
      join_df1_df2_repeated_col(df1, df2)
      

      【讨论】:

        【解决方案3】:

        到目前为止,@Rob Raymond 方法更好。

        但是,如果两个数据帧的行数相同,您可以使用字典和 for 循环获得类似的结果(Pandas 框架内的不良做法)。

        df_have1 = pd.DataFrame({
            'age':[7,34,19], 
            'gender':['F',np.nan,'M'], 
            'profession':['student', 'CEO', 'artist']})
        df_have2 = pd.DataFrame({
            'age':[7,34,19], 
            'gender':['np.nan','F',np.nan], 
            'interests':['acting', 'cars', 'gardening']})
        df_need = pd.DataFrame({
            'age':[7,34,19],
            'gender':['F','F','M'],
            'profession':['student', 'CEO', 'artist'],
            'interests':['acting', 'cars', 'gardening']})
        
        dct = {k:{} for k in (list(df_have1.columns) + list(df_have2.columns))}
        for col in dct.keys():
            if col in list(df_have1.columns):
                for row in df_have1.index:
                    if col in list(df_have2.columns):  # intersection
                        if df_have1[col].iloc[row] not in ['NaN', np.nan]:
                            dct[col][row] = df_have1[col].iloc[row]
                        elif df_have2[col].iloc[row] not in ['NaN', np.nan]:
                            dct[col][row] = df_have2[col].iloc[row]
                        else:  # without NaN values in the entry
                            dct[col][row] = np.nan
                    else:  # data only in df_have1
                        dct[col][row] = df_have1[col].iloc[row]
            else:  # data only in df_have2
                for row in df_have2.index:
                    dct[col][row] = df_have2[col].iloc[row]
        
        df_get = pd.DataFrame(dct)
        
        assert df_get.equals(df_need)  # assures the both df are the same
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2015-08-18
          • 2012-03-22
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多