Python Pandas - 类似于 ISIN 但“包含”与“精确”匹配答案

【问题标题】：Python Pandas - something like ISIN but "contains" vs "exact" matchPython Pandas - 类似于 ISIN 但“包含”与“精确”匹配
【发布时间】：2016-06-02 16:54:53
【问题描述】：

我正在使用 Python Pandas 处理两个数据帧。第一个数据框包含来自客户数据库（名字、姓氏、电子邮件等）的记录。第二个数据框包含域名列表，例如gmail.com、hotmail.com 等

当电子邮件地址包含第二个列表中的域名时，我正在尝试从客户数据框中排除记录。换句话说，当客户的电子邮件地址域出现在域黑名单中时，我需要删除他们。

以下是示例数据框：

>>> customer = pd.DataFrame({'Email': [
    "bob@example.com", 
    "jim@example.com", 
    "joe@gmail.com"], 'First Name': [
    "Bob", 
    "Jim", 
    "Joe"]})

>>> blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})

>>> customer
         Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim
2    joe@gmail.com        Joe
>>> blacklist
  Domain
0  gmail.com
1  outlook.com

我想要的输出是：

>>> filtered_list = magic_happens_here(customer, blacklist)
>>> filtered_list
    Email First Name
0 bob@example.com    Bob
1 jim@example.com    Jim

到目前为止我已经尝试过：

为了消除特定电子邮件地址，过去我使用过df1[df1['email'].isin(~df2['email']) ...但显然对我在这里描述的用例没有帮助。
我尝试过使用df.apply，但语法不正确，我想实际数据集的性能会很糟糕。示例：df1['Email'].apply(lambda x: x for i in ['gmail.com', 'outlook.com'] if i in x)。虽然这看起来应该可行，但我得到了TypeError: 'generator' object is not callable。

剩下的问题是：

这里最好的方法是什么？
为什么生成器不可调用？
...最终，当排除集中存在电子邮件地址域时，如何将客户从数据框中排除？

【问题讨论】：

添加示例数据框。
@VedangMehta 好点，我添加了示例数据框。
我添加了一些比较和时间安排 - 你可能会感兴趣...

标签： python pandas dataframe

【解决方案1】：

试试这个：

customer[~customer.Email.str.endswith(invalid_emails)]

或

customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]

In [399]: filtered_list
Out[399]:
             Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim

解释：

In [395]: customer.Email.str.replace(r'^[^@]*\@', '')
Out[395]:
0    example.com
1    example.com
2      gmail.com
Name: Email, dtype: object

In [396]: customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)
Out[396]:
0    False
1    False
2     True
Name: Email, dtype: bool

计时：：针对 300K 行 DF：

In [401]: customer = pd.concat([customer] * 10**5)

In [402]: customer.shape
Out[402]: (300000, 2)

In [420]: %timeit customer[~customer.Email.str.endswith(invalid_emails)]
10 loops, best of 3: 136 ms per loop

In [421]: %timeit customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
10 loops, best of 3: 151 ms per loop

In [422]: %timeit customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
1 loop, best of 3: 642 ms per loop

结论：

customer[~customer.Email.str.endswith(invalid_emails)] 比 customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))] 快一点，customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)] 要慢很多

【讨论】：

【解决方案2】：

代码-

import pandas as pd


customer = pd.DataFrame({'Email': [
    "bob@example.com",
    "jim@example.com", 
    "joe@gmail.com"], 'First Name': [
    "Bob", 
    "Jim", 
    "Joe"]})

blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})

invalid_emails = tuple(blacklist['Domain'])

df = customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]

print(df)

输出 -

             Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim

【讨论】：

哇，我不知道以接受列表结尾。效果很好。
它接受元组或字符串，但不接受列表。