过滤 dtype 转换失败的行答案

【问题标题】：filter rows failing dtype conversion过滤 dtype 转换失败的行
【发布时间】：2021-11-26 04:31:51
【问题描述】：

我有一个包含很多列的 pandas 数据框。所有列的 dtype 都是对象，因为有些列的值是字符串。有没有办法将行过滤到不同的数据框中，其中任何列中的值都是字符串，然后将清理后的数据框转换为整数 dtype。

我想出了第二部分，但无法实现第一部分 - 如果值包含字符串字符，例如“a”、“b”等，则过滤掉行。如果 df 是：

df = pd.DataFrame({
    'col1':[1,2,'a',0,3],
    'col2':[1,2,3,4,5],
    'col3':[1,2,3,'45a5',4]
    })

这应该变成 2 个数据帧

df = pd.DataFrame({
    'col1':[1,2,3],
    'col2':[1,2,5],
    'col3':[1,2,4]
    })

dfError = pd.DataFrame({
    'col1':['a',0],
    'col2':[3,4],
    'col3':[3,'45a5']
    })

【问题讨论】：

标签： python pandas string dtype

【解决方案1】：

我相信这是一种有效的方法。

import pandas as pd

df = pd.DataFrame({ # main dataframe
    'col1':[1,2,'a',0,3],
    'col2':[1,2,3,4,5],
    'col3':[1,2,3,'45a5',4]
    }) 

mask = df.apply(pd.to_numeric, errors='coerce').isna() # checks if couldn't be numeric
mask = mask.any(1) # check rows that couldn't be numeric

df1 = df[~mask] # could be numeric
df2 = df[mask]  # couldn't be numeric

分解：

df.apply(pd.to_numeric) # converts the dataframe into numeric, but this would give us an error for the string elements (like 'a')

df.apply(pd.to_numeric, errors='coerce') # 'coerce' sets any non-valid element to NaN (converts the string elements to NaN).

>>>
   col1  col2  col3
0   1.0     1   1.0
1   2.0     2   2.0
2   NaN     3   3.0
3   0.0     4   NaN
4   3.0     5   4.0

mask.isna() # Detect missing values.
>>>
    col1   col2   col3
1  False  False  False
2   True  False  False
3  False  False   True
4  False  False  False

mask.any(1) # Returns whether any element is True along the rows
>>>
0    False
1    False
2     True
3     True
4    False

【讨论】：

你能解释一下errors='coerce'的作用吗？我也不知道有什么作用。
用解释编辑了答案。 @fellowCoder

【解决方案2】：

不知道是否有一种高效的方法来检查这一点。但是一种肮脏的方式（可能很慢）可能是：

str_cond = df.applymap(lambda x: isinstance(x, str)).any(1)

df[~str_cond]
  col1  col2 col3
0    1     1    1
1    2     2    2
4    3     5    4

df[str_cond]
  col1  col2  col3
2    a     3     3
3    0     4  45a5

【讨论】：

你能解释一下 lambda 和任何发生的事情吗？
applymap 将使用lambda 应用数据框中的每个元素，我们检查元素是否为str 类型。 any(1) 如果一行中的任何元素是字符串，则返回 true，我们可以进一步使用它来过滤数据框。
只是为了确认。 any 中的“1”表示是否有任何一列。如果我在那里有“2”，那将意味着任何 2 列。正确的？还有什么对所有人的价值？
No 1 代表row 而0 代表column。因此，如果您将1 放在那里，那么每一行都会得到任何内容，如果您将0 放在那里，那么每一列都会得到任何内容。我认为你不能把2 放在那里。
明白了！谢谢！