【发布时间】:2016-12-20 18:12:51
【问题描述】:
目标
搜索安然电子邮件的语料库,查找与非凡的证券欺诈者 Ken Lay 往来的电子邮件。
数据
一个这样的电子邮件文档,包含 50 万多封名为 workdocs 的电子邮件,其结构如下:
一个这样的文件:
{'headers': {'To': 'eric.bass@enron.com', 'Subject': 'Re: Plays and other information', 'X-cc': '', 'X-To': 'Eric Bass', 'Date': 'Tue, 14 Nov 2000 08:22:00 -0800 (PST)', 'Message-ID': '<6884142.1075854677416.JavaMail.evans@thyme>', 'From': 'michael.simmons@enron.com', 'X-From': 'Michael Simmons', 'X-bcc': ''}, 'subFolder': 'notes_inbox', 'mailbox': 'bass-e', '_id': ObjectId('4f16fc97d1e2d32371003e27'), 'body': "the scrimmage is still up in the air...\n\n\nwebb said that they didnt want to scrimmage...\n\nthe aggies are scrimmaging each other... (the aggie teams practiced on \nSunday)\n\nwhen I called the aggie captains to see if we could use their field.... they \nsaid that it was tooo smalll for us to use...\n\n\nsounds like bullshit to me... but what can we do....\n\n\nanyway... we will have to do another practice Wed. night.... and I dont' \nknow where we can practice.... any suggestions...\n\n\nalso, we still need one more person..."}
我感兴趣的字段是{'To':...,'From':...,'X-cc':...,'X-bcc':...},可以在'headers'字段中找到。
实施(和错误)
在整个文档中搜索'klay@enron' 似乎可以使用workdocs.find({'$text':{'$search':'klay@enron.com'}}),但我有兴趣使用正则表达式捕获许多可能的电子邮件别名。如何在 To、From、X-bcc 和 X-cc 字段中找到与正则表达式 ken_email(下)匹配的文档?
from pymongo import MongoClient
import re
re_email = '^(K|Ken|Kenneth)[A-Z0-9._%+-]*Lay@[A-Z0-9._%+-]+\.[A-Z]{2,4}$'
ken_email = re.compile(re_email, re.IGNORECASE)
【问题讨论】:
-
我想你需要的是Wildcard Text Indexes
-
不确定这对我有用。该索引允许对具有字符串内容的所有字段进行文本搜索。我正在查看上面提到的 4 个特定字段。
-
'((?:K((en)?neth)?)[A-Z0-9._%+-]*Lay@[A-Z0-9._%+-]+\.[A-Z]{2,4})'?
标签: python regex mongodb pymongo