多个 PyMongo 文档字段中的正则表达式搜索答案

【问题标题】：Regex Search in Multiple PyMongo Document Fields多个 PyMongo 文档字段中的正则表达式搜索
【发布时间】：2016-12-20 18:12:51
【问题描述】：

目标

搜索安然电子邮件的语料库，查找与非凡的证券欺诈者 Ken Lay 往来的电子邮件。

数据

一个这样的电子邮件文档，包含 50 万多封名为 workdocs 的电子邮件，其结构如下：

一个这样的文件：

 {'headers': {'To': 'eric.bass@enron.com', 'Subject': 'Re: Plays and other information', 'X-cc': '', 'X-To': 'Eric Bass', 'Date': 'Tue, 14 Nov 2000 08:22:00 -0800 (PST)', 'Message-ID': '<6884142.1075854677416.JavaMail.evans@thyme>', 'From': 'michael.simmons@enron.com', 'X-From': 'Michael Simmons', 'X-bcc': ''}, 'subFolder': 'notes_inbox', 'mailbox': 'bass-e', '_id': ObjectId('4f16fc97d1e2d32371003e27'), 'body': "the scrimmage is still up in the air...\n\n\nwebb said that they didnt want to scrimmage...\n\nthe aggies  are scrimmaging each other... (the aggie teams practiced on \nSunday)\n\nwhen I called the aggie captains to see if we could use their field.... they \nsaid that it was tooo smalll for us to use...\n\n\nsounds like bullshit to me... but what can we do....\n\n\nanyway... we will have to do another practice Wed. night....    and I dont' \nknow where we can practice.... any suggestions...\n\n\nalso,  we still need one  more person..."}

我感兴趣的字段是{'To':...,'From':...,'X-cc':...,'X-bcc':...}，可以在'headers'字段中找到。

实施（和错误）

在整个文档中搜索'klay@enron' 似乎可以使用workdocs.find({'$text':{'$search':'klay@enron.com'}})，但我有兴趣使用正则表达式捕获许多可能的电子邮件别名。如何在 To、From、X-bcc 和 X-cc 字段中找到与正则表达式 ken_email（下）匹配的文档？

from pymongo import MongoClient  
import re
re_email = '^(K|Ken|Kenneth)[A-Z0-9._%+-]*Lay@[A-Z0-9._%+-]+\.[A-Z]{2,4}$'
ken_email = re.compile(re_email, re.IGNORECASE)

【问题讨论】：

我想你需要的是Wildcard Text Indexes
不确定这对我有用。该索引允许对具有字符串内容的所有字段进行文本搜索。我正在查看上面提到的 4 个特定字段。
'((?:K((en)?neth)?)[A-Z0-9._%+-]*Lay@[A-Z0-9._%+-]+\.[A-Z]{2,4})' ?

标签： python regex mongodb pymongo

【解决方案1】：

要只搜索这四个字段，您可以使用：

(?:to|from|x-b?cc)'\s*:\s*'K[A-Z0-9._%+-]*Lay@[A-Z0-9._%+-]+\.[A-Z]{2,4}

该版本删除了围绕他的名字的捕获组，这对于匹配的发生是不必要的。（正则表达式完成后提取会更快。）

我也不相信有必要验证电子邮件地址。您已经在寻找应该只有电子邮件地址的字段。您可以进一步缩短正则表达式：

(?:to|from|x-b?cc)'\s*:\s*'K[A-Z0-9._%+-]*Lay

这将有匹配klay123@example.com的额外好处

它的效率不是很高（尤其是对于长字符串），但有一些方法可以加快速度。最简单的方法是事先移除身体。（这也可能有助于防止误报。）您可以删除第一个 } 之后的所有内容。

只是为了好玩，这里有一个匹配的正则表达式：

\}.*

只需替换为空字符串即可将其删除。

【讨论】：