Lambda排除包含字符串列表的值[重复]答案

【问题标题】：Lambda to exclude value containing a list of string [duplicate]Lambda排除包含字符串列表的值[重复]
【发布时间】：2020-01-13 11:55:00
【问题描述】：

如果我的 pandas 数据框中的某些行具有某些值，我想排除它们。

excluded_url_subpath = ['/editer', '/administration', '/voir-les-transactions', '/modifier', '/diffuser', '/creation-paiement']

所以我有一个可行的解决方案来一一做：

df = df[df['pagepath'].map(lambda x: False if '/editer' in x else True)]
df = df[df['pagepath'].map(lambda x: False if '/administration' in x else True)]
...

或者我可以使用我写的列表。但是我尝试了一些东西，IDE 告诉我我无法访问变量 x。

df = df[df['pagepath'].map(lambda x: False for i in excluded_url_subpath if x in i)]

这里的错误在哪里？

【问题讨论】：

致审稿人。是的，有人可能已经发布了类似的内容，但答案使用了昂贵的正则表达式。我更喜欢@fabio-lipreri 的解决方案

标签： python pandas lambda

【解决方案1】：

您可以使用正则表达式，我构建了一个示例数据框：

import pandas as pd
data = {'pagepath': ['/editer', 'to_keep', 'to_delete/editer/to_delete', 'hello/voir-les-transactions', 'to_keep'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
print(df)

使用前面的代码，我们构建了以下数据集：

                               pagepath  year  reports
Cochice                         /editer  2012        4
Pima                            to_keep  2012       24
Santa Cruz   to_delete/editer/to_delete  2013       31
Maricopa    hello/voir-les-transactions  2014        2
Yuma                            to_keep  2014        3

现在，我根据您的情况调整了 this answer 的解决方案。首先，为了实现一个通用的解决方案，我对excluded_url_subpath列表中的字符串可能包含的可能的非字母数字字符进行了转义。

import re
excluded_url_subpath = ['/editer', '/administration', '/voir-les-transactions', '/modifier', '/diffuser', '/creation-paiement']
safe_excluded_url_subpath = [re.escape(m) for m in excluded_url_subpath]

现在，使用contains 函数，我构造了一个正则表达式，将您的列表加入到使用| 中：

df[~df.pagepath.str.contains('|'.join(safe_excluded_url_subpath))]

我得到了以下数据框：

     pagepath  year  reports
Pima  to_keep  2012       24
Yuma  to_keep  2014        3

【讨论】：

很好，我忘了isin 和~。
我只是试了一下，它不起作用。该列表是我正在寻找的内容的子字符串@fabio-lipreri
我不知道您正在搜索子字符串，我编辑了答案，如果它有效，请告诉我。
太棒了！像魅力一样工作。

【解决方案2】：

您可以通过过滤数据框来做到这一点：

for excluded in excluded_url_subpath:
      df['pagepath'] = df[df['pagepath'] != excluded]

【讨论】：

我明白你在做什么，但这里的算法将查看每个循环的整个 DF。它的工作方式与使用硬编码字符串尝试每个值的方式相同。只是自动化程度更高。