使用 Pandas 从 Dataframe 的两列中过滤非数字数据答案

【问题标题】：Using Pandas filtering non-numeric data from two columns of a Dataframe使用 Pandas 从 Dataframe 的两列中过滤非数字数据
【发布时间】：2016-07-26 07:44:47
【问题描述】：

我正在加载具有多种数据类型（从 Excel 加载）的 Pandas 数据框。两个特定的列应该是浮动的，但有时研究人员会输入一个随机评论，如“未测量”。我需要删除任何两列之一中的任何值不是数字的行，并在其他列中保留非数字数据。一个简单的用例是这样的（真实的表有几千行……）

import pandas as pd

df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))

该数据表中的结果：

    A   B   C               D
0   1   96  12              apples
1   2   33  Not measured    oranges
2   3   45  15              peaches
3   4       66              plums
4   5   8   42              pears

我不清楚如何到达这张桌子：

    A   B   C               D
0   1   96  12              apples
2   3   45  15              peaches
4   5   8   42              pears

我试过 dropna，但类型是“对象”，因为有非数字条目。如果不转换整个表，或者一次执行一个系列，我就无法将值转换为浮点数，这会失去与行中其他数据的关系。也许有一些简单的我不理解？

【问题讨论】：

标签： excel numpy pandas

【解决方案1】：

您可以先创建包含B、C 和apply to_numeric 列的子集，检查all 的值是否为notnull。然后使用boolean indexing:

print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0     True
1    False
2     True
3    False
4     True
dtype: bool

print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

下一个解决方案使用 str.isdigit 和 isnull 和 xor (^)：

print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0     True
1    False
2     True
3    False
4     True
dtype: bool

print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

但是to_numeric 和isnull 和notnull 的解决方案最快：

print df[pd.to_numeric(df['B'], errors='coerce').notnull() 
       ^ pd.to_numeric(df['C'], errors='coerce').isnull()]

   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

时间安排：

#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)

In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop

In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop

In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 3.49 ms per loop

【讨论】：

谢谢！为了可维护性，我喜欢第一个使用 apply，notnull 的解决方案。它似乎工作！我会花一天时间看看是否有任何问题弹出，或者是否有人用更简单的解决方案作出回应。