python：数据清理 - 检测欺诈电子邮件地址的模式答案

【问题标题】：python: data cleaning - detect pattern for fraudulent email addressespython：数据清理 - 检测欺诈电子邮件地址的模式
【发布时间】：2017-11-19 09:02:04
【问题描述】：

我正在清理包含我要删除的欺诈性电子邮件地址的数据集。

我制定了多个规则来捕获重复和欺诈域。但是有一个场景，我想不出如何在 python 中编写规则来标记它们。

所以我有例如这样的规则：

#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))    

#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')

#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)

这是我无法找出规则来捕捉它的数据。基本上我正在寻找一种方法来标记以相同方式开始但最后有连续数字的地址。

abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com

【问题讨论】：

澄清一下 - 您希望将您提供的示例中的哪些电子邮件地址标记为欺诈？
所有这些例子都是欺诈
所以 abc@gmail.com 可以，但是像 abc1、abc12、.. 这样的东西会是欺诈性的吗？如果 abc@gmail.com 存在，这些只会是欺诈性的？
@jeangelj 什么是“yopmail”，它与数据有什么关系？
只允许来自白名单的电子邮件地址可能是值得的，需要验证以及阻止来自同一 IP 的多个注册。您可以整天编写正则表达式规则，但任何有啤酒和空闲时间的人都可以解决这个问题。

标签： python data-cleaning

【解决方案1】：

我的解决方案效率不高，也不漂亮。但是检查一下，看看它是否适合你@jeangelj。它绝对适用于您提供的示例。祝你好运！

import os
from random import shuffle
from difflib import SequenceMatcher

emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]

T = 0.7 # <- set your string similarity threshold here!!

split_indices=[]
for i in range(1,len(emails)):
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
        split_indices.append(i) # we want to remember where dissimilar email address occurs

grouped=[]
for i in split_indices:
    grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
    prefix_strings.append(os.path.commonprefix(group))

# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
    if i in true_ids:
        ham.append(emails[i])
    else:
        spam.append(emails[i])

In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']

In [31]: spam
Out[31]: 
['abc7020.10@gmail.com',
 'abc7020.11@gmail.com',
 'abc7020.12@gmail.com',
 'abc7020.13@gmail.com',
 'abc7020.14@gmail.com',
 'abc7020.15@gmail.com',
 'abc7020.1@gmail.com',
 'attn12345678@gmail.com',
 'attn1234567@gmail.com',
 'attn123456@gmail.com',
 'attn12345@gmail.com',
 'attn1234@gmail.com',
 'attn123@gmail.com',
 'attn12@gmail.com']  

# THE TRUTH YALL!

【讨论】：

【解决方案2】：

我对如何解决这个问题有个想法：

模糊不清

创建一组独特的电子邮件，对它们进行 for 循环，并将它们与fuzzywuzzy 进行比较。示例：

from fuzzywuzzy import fuzz 

   for email in emailset:

      for row in data:
         emailcomp = re.search(pattern=r'(.+)@.+',string=email).groups()[0] 
         rowemail = re.search(pattern=r'(.+)@.+',string=row['email']).groups()[0] 
         if row['email']==email:
                  continue

          elif fuzz.partial_ratio(emailcomp,rowemail)>80:
                  'flagging operation'

我对数据的表示方式有些随意，但我觉得变量名称足以助记，让您理解我的意思。这是一段非常粗糙的代码，因为我还没有考虑过如何停止重复标记。

无论如何，elif 部分会比较两个没有@gmail.com（或任何其他电子邮件，例如@yahoo.com）的电子邮件地址，如果比率高于80（玩弄这个数字），请使用您的标记操作。例如：

fuzz.partial_ratio("abc7020.1", "abc7020")

100

【讨论】：

嗨！不用担心，我不会对花时间帮助我的人投反对票。非常感谢你的这个想法——我一定会尝试的；它实际上可能是一个很好的解决方案，我只是想确保它不会将合法的电子邮件标记为垃圾邮件

【解决方案3】：

这是处理它的一种方法，应该非常有效。我们通过按长度对电子邮件地址进行分组来做到这一点，这样我们只需要检查每个电子邮件地址是否与向下的级别匹配，通过切片并设置成员资格检查。

代码：

首先，读入数据：

import pandas as pd
import numpy as np

string = '''
abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com
foo123@bar.com
foo1@bar.com
'''

x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]

我们去掉@foo.bar 部分，然后只过滤那些以数字结尾的部分，并添加一个“长度”列：

#split on @, expand means into two columns
emails =  x.x.str.split('@', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()

现在，我们要做的就是取每个长度，长度为-1，看看长度。它的最后一个字符被删除，出现在一组 n-1 长度中（并且，我们必须检查是否相反，以防它是最短的重复）：

#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)

#for each length
for j in lengths:
    #we subset those of that length
    totest = emails['lengths'] == j
    #and those who might be the shorter version
    against = emails['lengths'] == j -1
    #we make a set of unique values, for a hashed lookup
    againstset = set([i for i in emails.loc[against,0]])
    #we cut off the last char of each in to test
    tests = emails.loc[totest,0].str[:-1]
    #we check matches, by checking the set
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
    #viceversa, otherwise we miss the smallest one in the group
    againstset = set([i for i in emails.loc[totest,0].str[:-1]])
    tests = emails.loc[against,0]
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)

生成的掩码可以转换为布尔值，并用于对原始（去重）数据帧进行子集化，并且索引应将原始索引与子集匹配，如下所示：

x.loc[~mask.astype(bool),:]
    x
0   abc7020@gmail.com
16  foo123@bar.com
17  foo1@bar.com

您可以看到我们没有删除您的第一个值，如 '.'表示不匹配 - 您可以先删除标点符号。

【讨论】：

【解决方案4】：

您可以使用正则表达式来执行此操作；下面的例子：

import re

a = "attn12345@gmail.comf"
b = "abc7020.14@gmail.com"
c = "abc7020@gmail.com"
d = "attn12345678@gmail.com"

pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?@")

if pattern.search(a):
    print("spam1")

if pattern.search(b):
    print("spam2")

if pattern.search(c):
    print("spam3")

if pattern.search(d):
    print("spam4")

如果你运行代码你会看到：

$ python spam.py 
spam1
spam2
spam3
spam4

这种方法的好处是它的标准化（正则表达式），并且您可以通过调整{} 中的值轻松调整匹配强度；这意味着您可以拥有一个全局配置文件，您可以在其中设置/调整值。您还可以轻松调整正则表达式，无需重写代码。

【讨论】：

谢谢 - 我有一个用户 anna213@gmail.com 这是一个合法用户，但 attn12、attn123、attn1234、attn12345 不是，我只想抓住那些
@bbb31 ..或尽可能使用验证码。

【解决方案5】：

ids = [s.split('@')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
    for j in range(i + 1, len(ids)):
        mi = ids[i]
        mj = ids[j]
        if len(mj) == len(mi) + 1 and mj.startswith(mi):
            try:
                int(mj[-1])
                det[j,i] = True
                det[i,j] = True
            except:
                continue

spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()

【讨论】：

谢谢！我现在就测试一下

【解决方案6】：

您可以使用编辑距离（又名Levenshtein distance）选择差异阈值。在python中：

$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1@gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])

如果您想更聪明一点，您可以遍历结果列表，而不是将其变成一个集合，而是跟踪它附近有多少其他电子邮件地址 - 然后将其用作“权重”判断虚伪。

这不仅可以为您提供给定的案例（其中欺诈性地址都共享一个共同的开头并且仅在数字后缀上有所不同，而且还可以在电子邮件地址的开头或中间添加数字或字母填充。

【讨论】：

【解决方案7】：

先看一下正则表达式问题here

其次，尝试像这样过滤电子邮件地址：

# Let's email is = 'attn1234@gmail.com'
email = 'attn1234@gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
    print ('%s is good' % email)
else:
    print ('%s is BAD' % email)

【讨论】：

谢谢！在发布我的问题之前，我实际上查看了那个正则表达式问题；对于我面临的所有迭代，我觉得答案不够灵活，但非常感谢您参考它；我现在将测试您的解决方案；这将如何区分 john1988@gmail.com 和 attn12、attn123、attn1234 等合法电子邮件？
您正在处理多少个唯一地址？如果不是太多，那么我认为您可以结合两种方法：为使用过的（唯一）电子邮件定义一个 list() 并在正则表达式捕获以数字结尾的那些之后，像@dman 考虑的那样检查它们。虽然 john1988@gmail.com 仍然是个问题，因为 john1989@gmail.com 看起来也有效。
一次大约 100,000 封电子邮件，每个月都不一样 - 我认为不可能定义一个列表，如果我理解正确的话，每个月都适用
@jeangelj 您可以在运行时定义和填充该列表。从空的开始，然后在此处附加“好”地址。在每次过滤迭代中，删除包含大量数字的电子邮件，然后使用 fuzz.partial_ratio() 对该列表的每个元素进行精细过滤以检测欺诈行为。如果电子邮件地址比率表明它是“好”的，则将其附加到该列表中，这样就会有已知的好地址。（但当列表很大时，这可能不会那么快）。