根据数据透视表值替换数据框中的值答案

【问题标题】：replace values in data frame based on pivot table values根据数据透视表值替换数据框中的值
【发布时间】：2020-11-22 15:09:50
【问题描述】：

我想根据数据透视表中给出的值替换数据框“年龄列”中的 nan 值，

“0为女性，1为男性”

Example of df

Pclass Gender Age
  3      1     22
  1      0     38
  2      1     27
  3      0    NaN

Pivot table
            Age
    Gender 0  1
    PClass 
    1     40  35
    2     30  28
    3     25  21

例如，如果错过了这个人的年龄，如果他/她是 Pclass 3 和性别 0，那么他的年龄是 25。

我有大约 100 行需要更新，有没有快速的方法？

【问题讨论】：

请提供您的 df 的文本（不是图像）。例如，您可以使用 df.to_dict()。
年龄列中没有nans...
我刚刚更新了文字

标签： python pandas

【解决方案1】：

我会将数据透视表转换为常规 df

pdf = pivot_table.stack().reset_index()

然后与nandf和combine_first合并

nan_df = df.loc[df['Age'].isna(), ['Pclass', 'Gender']].merge(pdf, how='left')
df.set_index(['Pclass', 'Gender']).combine_first(nan_df.set_index(['Pclass', 'Gender'])).reset_index()

   Pclass  Gender   Age
0       1       0  38.0
1       2       1  27.0
2       3       0  25.0
3       3       1  22.0

【讨论】：

【解决方案2】：

您可以先使用创建pivot_table，然后将其与df 合并回来，并在观察到NaN 时替换这些值

Example of df

Pclass Gender Age
  3      1     22
  1      0     38
  2      1     27
  3      0    NaN

Pivot table
            Age
    Gender 0  1
    PClass 
    1     40  35
    2     30  28
    3     25  21

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['PClass','Gender','Age'])
df['PClass'] = [3,1,2,3]
df['Gender'] = [1,0,1,0]
df['Age'] = [22,38,27,np.nan]

df_pivot = pd.pivot_table(df,index=['PClass'],columns=['Gender'],values=['Age'],aggfunc='mean',fill_value=0) ### you can choose your own aggfunc
### I have taken `mean` here , but there ae a bunch of available options

df_pivot = df_pivot.unstack().reset_index().rename(columns={0:'Avg_Age_Pivot'})

df = pd.merge(df,df_pivot[['PClass','Gender','Avg_Age_Pivot']],on=['PClass','Gender'])

def replace_na(inp):
     inp = inp.values
     if pd.isnull(inp[0]):
        return inp[1]
     return inp[0]
 
 
df['Age'] = df[['Age','Avg_Age']].apply(replace_na,axis=1)

df _pivot O/P --->

>>> pd.pivot_table(df,index=['PClass'],columns=['Gender'],values=['Age'],aggfunc='mean') ### you can choose your own aggfunc
         Age      
Gender     0     1
PClass            
1       38.0   NaN
2        NaN  27.0
3        NaN  22.0

您可以进一步决定保留或删除 Avg_Age_Pivot 列。

我还注意到，根据您提供的数据量，pivot_table 中有 NaN 值，因此您看不到当前 df 值的预期结果

【讨论】：

非常感谢！但是，我不太明白 pivot_table 中的“NaN”值是什么意思，因为正好有 3x2 和 6 个条目。
在答案中添加了df_pivot输出，对于数据本身不存在的枢轴索引，aggfunc将返回NaN
哦，我明白了！这不是实际数据，不用担心。谢谢

【解决方案3】：

请参阅此方法。通过组合“PClass”和“Gender”列创建了一个名为 new 的通用列。然后使用map 和df.fillna 替换NaN 值。我必须创建这个新列，因为我只能在 pd.series 上应用 map 方法。

输入：

import io
df1  = pd.read_csv(io.StringIO("""
PClass Gender Age
  3      1     22
  1      0     38
  2      1     27
  3      0    NaN
  """), sep=r"\s{1,}", engine="python") 

import io
df2  = pd.read_csv(io.StringIO("""
PClass  Gender Age
    1     0  40
    2     0  30
    3     0  25
    1     1  35
    2     1  28
    3     1  21
  """), sep=r"\s{1,}", engine="python")

df1（实际df）

  PClass  Gender   Age
0       3       1  22.0
1       1       0  38.0
2       2       1  27.0
3       3       0   NaN

df2（数据透视表）

  PClass  Gender  Age
0       1       0   40
1       2       0   30
2       3       0   25
3       1       1   35
4       2       1   28
5       3       1   21

代码：

df1['new'] = df1['PClass'].astype(str)+df1['Gender'].astype(str)
df2['new'] = df2['PClass'].astype(str)+df2['Gender'].astype(str)
fill = df2.set_index(['new'])['Age'].to_dict()
df1['Age'] = df1['Age'].fillna(df1['new'].map(fill))
df1 = df1.drop('new',axis=1)
print(df1)

打印：

   PClass  Gender   Age
0       3       1  22.0
1       1       0  38.0
2       2       1  27.0
3       3       0  25.0

【讨论】：