用其他列中的过滤值填充选定列中的缺失值答案

【问题标题】：Fill missing values in selected columns with filtered values in other column用其他列中的过滤值填充选定列中的缺失值
【发布时间】：2020-12-25 10:47:14
【问题描述】：

我在一个数据框中有一个名为 null 的奇怪列，其中包含其他列中的一些缺失值。一列是名为location 的经纬度坐标，另一列是表示名为level 的目标变量的整数。在location 或level 缺少值的某些但不是所有情况下，应该存在的值在此null 列中。这是一个例子df：

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
     'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
     'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
     }
)

我需要能够根据值是整数还是字符串来过滤null 列，然后在此基础上用适当的值填充适当列中的缺失值。我尝试在 for 循环中使用 .apply() 和 lambda 函数以及 .match()、.contains() 和 in，但到目前为止没有运气。

【问题讨论】：

你的预期输出是什么
我需要能够根据值是整数还是字符串来过滤空列，然后在此基础上，在适当的列中填充缺失值（字符串在'location' 和 'level' 中的整数）。
检查我的答案，让我知道它是否有效~

标签： python regex pandas null fillna

【解决方案1】：

如果不是最简单的方法，最简单的方法是简单地用df.null 中的值填充df.location 和df.level 中的所有缺失值，然后使用正则表达式创建一个布尔过滤器以返回不适当/错误分配的值df.location 和 df.level 到 np.nan。

pd.fillna()

df = pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
     'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
     'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
     }
)

for col in ['location', 'level']:
     df[col].fillna(
          value = stress.null,
          inplace = True
     )

现在我们将使用字符串表达式来纠正错误分配的值。

str.contains()

# Converting columns to type str so string methods work
df = df.astype(str)

# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex), 'level'] = np.nan
    
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
    
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)

这是输出：

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
      'location': {0: '43.70477575,-72.28844073', 1: nan, 2: '43.70637091,-72.28704334', 3: nan, 4: nan},
      'level': {0: nan, 1: '2', 2: nan, 3: '4', 4: '3'}
     }
)

【讨论】：

【解决方案2】：

让我们试试to_numeric

checker = pd.to_numeric(df.null, errors='coerce')
checker
Out[171]: 
0    NaN
1    2.0
2    NaN
3    4.0
4    3.0
Name: null, dtype: float64

并应用isnull，如果返回NaN表示字符串不是int

isstring = checker.isnull()
Out[172]: 
0     True
1    False
2     True
3    False
4    False
Name: null, dtype: bool
# isnumber = checker.notnull()

填充值

df.loc[isnumber, 'location'] = df['null']
df.loc[isstring, 'level'] = df['null']

【讨论】：

此代码可以过滤整数的空列，但它不会用这些值填充级别列中的缺失值。
@KristianCanler 您可以填写以上条件并检查更新

【解决方案3】：

另一种方法可能使用方法pandas.Series.mask：

>>> df
                       null  location  level
0  43.70477575,-72.28844073       NaN    NaN
1                         2       NaN    NaN
2  43.70637091,-72.28704334       NaN    NaN
3                         4       NaN    NaN
4                         3       NaN    NaN
>>> df.level.mask(df.null.str.isnumeric(), other = df.null, inplace = True)
>>> df.location.where(df.null.str.isnumeric(), other = df.null, inplace = True)
>>>
>>> df
                       null                  location level
0  43.70477575,-72.28844073  43.70477575,-72.28844073   NaN
1                         2                       NaN     2
2  43.70637091,-72.28704334  43.70637091,-72.28704334   NaN
3                         4                       NaN     4
4                         3                       NaN     3

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html

【讨论】：

抱歉，我需要更新数据框！那是经过编辑的版本。我现在在 OP 中找到了正确的。
当我实现这段代码时，它似乎将一些字符串填充到df.level中，这需要全是整数。