如何根据其他列的值估算 NaN 值？答案

【问题标题】：How to impute NaN values based on values of other column?如何根据其他列的值估算 NaN 值？
【发布时间】：2018-12-27 19:18:28
【问题描述】：

我在数据框中有 2 列

1)工作经验（年）

2)公司类型

我想根据工作经验列估算 company_type 列。 company_type 列具有我想根据工作经验列填充的 NaN 值。工作经验栏没有任何缺失值。

这里work_exp是数值数据，company_type是分类数据。

示例数据：

Work_exp      company_type
   10            PvtLtd
   0.5           startup
   6           Public Sector
   8               NaN
   1             startup
   9              PvtLtd
   4               NaN
   3           Public Sector
   2             startup
   0               NaN

我已经确定了估算 NaN 值的阈值。

Startup if work_exp < 2yrs
Public sector if work_exp > 2yrs and <8yrs
PvtLtd if work_exp >8yrs

根据上述阈值标准，我如何在 company_type 列中估算缺失的分类值。

【问题讨论】：

你可能想看看datascience.stackexchange.com/questions/17769/…

标签： python pandas

【解决方案1】：

您可以将numpy.select 与numpy.where 一起使用：

# define conditions and values
conditions = [df['Work_exp'] < 2, df['Work_exp'].between(2, 8), df['Work_exp'] > 8]
values = ['Startup', 'PublicSector', 'PvtLtd']

# apply logic where company_type is null
df['company_type'] = np.where(df['company_type'].isnull(),
                              np.select(conditions, values),
                              df['company_type'])

print(df)

   Work_exp  company_type
0      10.0        PvtLtd
1       0.5       startup
2       6.0  PublicSector
3       8.0  PublicSector
4       1.0       startup
5       9.0        PvtLtd
6       4.0  PublicSector
7       3.0  PublicSector
8       2.0       startup
9       0.0       Startup

pd.Series.between 默认包含开始值和结束值，并允许在float 值之间进行比较。使用inclusive=False 参数省略边界。

s = pd.Series([2, 2.5, 4, 4.5, 5])

s.between(2, 4.5)

0     True
1     True
2     True
3     True
4    False
dtype: bool

【讨论】：

@stone 摇滚。这个数据来自 Analytical Vidya 比赛，对吧。？工作经验变量实际上是一个字符串
@jpp 您能否就 between() 的包容性值添加一些答案？包括 2 和 8 吗？如果我使用 between(2,8)

【解决方案2】：

@jpp 的好答案。只是想在这里使用pandas.cut() 添加不同的方法。

df['company_type'] = pd.cut(
    df.Work_exp,
    bins=[0,2,8,100],
    right=False,
    labels=['Startup', 'Public', 'Private']
)



   Work_exp company_type
0   10.0    Private
1   0.5     Startup
2   6.0     Public
3   8.0     Private
4   1.0     Startup
5   9.0     Private
6   4.0     Public
7   3.0     Public
8   2.0     Public
9   0.0     Startup

同样根据您的条件，索引 8 应该是公开的吗？

Startup < 2
PublicSector >=2 and < 8
PvtLtd >= 8

【讨论】：

谢谢你的方法。
但是是否填充了缺失值？
如果像这里一样，您的标准将始终基于单个系列，这是一个更好的答案。但要小心，我认为 OP 只想覆盖 NaN 值。所以需要做更多的工作（可能是numpy.where + pd.cut）。
是的——它完全覆盖了原始的company_type 数据。您可以仅按 nan 值过滤并替换它们