使用两个正则表达式过滤数据框答案

【问题标题】：Filter dataframe with two regexes使用两个正则表达式过滤数据框
【发布时间】：2020-03-21 20:45:34
【问题描述】：

我有这个数据集：

frame = pd.DataFrame({'col_a' : [np.nan, 'in millions', 'millions', 'in thousands', 'thousands', np.nan, 'thousands', 'abcdef'],
                      'col_b' : ['2009', '2009', '2009', '2009', '2009', '2009', 'abc', '2009'],
                      'col_c' : ['2010', '2010', '2010', '2010', '2010', '2010', 'def', '2010'],
                      'col_d' : [np.nan, np.nan, np.nan, np.nan, np.nan, 'thousands', np.nan, np.nan]})

制作：

          col_a col_b col_c      col_d
0           NaN  2009  2010        NaN
1   in millions  2009  2010        NaN
2      millions  2009  2010        NaN
3  in thousands  2009  2010        NaN
4     thousands  2009  2010        NaN
5           NaN  2009  2010  thousands
6     thousands   abc   def        NaN
7        abcdef  2009  2010        NaN

我想为每一行过滤该数据框

有四位数字（可能在四位数字之前或之后有空格）；我为此使用正则表达式 \s*?\d{4}\s*?。
在任何列 (millions|thousands) 中有“百万”或“千”或 NaN，但不是“百万”或“千”以外的字符串。

也就是说，我想要第 0、1、2、3、4、5 行。

我这样做：

mask = frame.astype(str).apply(lambda x: x.str.contains(r'\s*?\d{4}\s*?',
                                                        regex = True,
                                                        flags = re.IGNORECASE,
                                                        na = False)).any(axis = 1)
test = frame[mask]
mask = test.astype(str).apply(lambda x: x.str.contains(r'(in)?millions|thousands',
                                                       regex = True,
                                                       flags = re.IGNORECASE,
                                                       na = False)).any(axis = 1)
test = test[mask]
test

这给出了：

          col_a col_b col_c      col_d
1   in millions  2009  2010        NaN
2      millions  2009  2010        NaN
3  in thousands  2009  2010        NaN
4     thousands  2009  2010        NaN
5           NaN  2009  2010  thousands

过滤后的数据帧中的第 0 行未命中，因为它在 col_a 和 col_d 中有 NaN。并且，Python 会抛出警告：

用户警告：布尔系列键将被重新索引以匹配 DataFrame 索引。

# 在加载内容时从 sys.path 中删除 CWD。p>

如何将其更改为也包括第 0 行？如果我将第二个正则表达式更改为 (in)?millions|thousands|NaN，我还会得到第 6 行和第 7 行，这不是我想要的。

编辑：在这个数据集中，我知道 col_a 和 col_d 包含第 0 行的 NaN。在实际数据集中，我不知道 NaN 出现在哪一列。也就是说，更一般地，过滤条件是：

一行的一列必须包含四位数字（第一个正则表达式），并且
任何其他列可能包含一个字符串，但该字符串只能是 'millions' 或 'thousands'

【问题讨论】：

你快到了，首先制作第一个面具mask1 和第二个mask2，然后我想你想要：df = frame[mask1 & (mask2 | frame['col_a'].isna())] 然后print(df).Ps，写得很好结构合理的问题！
谢谢二凡。这适用于这个特定的数据集。但是在我使用的真实数据集中，我不知道 NaN 出现在哪一列。你对此也有想法吗？也就是说：一行的一列必须包含四位数字（第一个正则表达式），任何其他列可能包含一个字符串，但该字符串只能是“百万”或“千”。我已经为此更新了原始问题。
但是按照你的逻辑第 7 行包含 4 位数字和 NaN，所以应该包括在内。所以你的逻辑是：“millions 或thousands 的列不应该包含任何其他字符串”对吗？
第 7 行包含 4 位数字，但字符串 'abcdef'，因此不应包含在内。并且逻辑应该是：一行可能包含一个带有字符串的列，但该字符串只能是millions 或thousands。

标签： python regex pandas

【解决方案1】：

如果有人遇到类似问题：我发现负前瞻对我有用：

mask = frame.apply(lambda x: x.str.contains(r'^(?!.*?thousand|.*?million|\d{4}).*?$',
                                            regex = True,
                                            flags = re.IGNORECASE)).any(axis = 1)
test = frame[~mask]
test

生产：

          col_a col_b col_c      col_d
0           NaN  2009  2010        NaN
1   in millions  2009  2010        NaN
2      millions  2009  2010        NaN
3  in thousands  2009  2010        NaN
4     thousands  2009  2010        NaN
5           NaN  2009  2010  thousands

【讨论】：