解析电子邮件并从正文中获取号码答案

【问题标题】：Parse email and get number from body解析电子邮件并从正文中获取号码
【发布时间】：2012-01-09 05:12:38
【问题描述】：

我想提取在电子邮件正文中找到的第一个数字。在电子邮件库的帮助下，我只将邮件正文提取到了一个字符串中。但问题是，在真正的纯文本正文开始之前，有一些关于编码等的信息（那些包含数字）。我怎样才能以一种可靠的方式跳过那些不依赖于创建电子邮件的客户端并且只获取第一个数字的方法。

如果我这样做

match = re.search('\d+', string, re.MULTILINE)

它将在有关编码或其他内容的信息中获得第一个匹配项，而不是在实际邮件内容中。

好的。我添加了一个示例。这就是它的外观（我将提取 123）。但我想它看起来可能与其他客户发送的不同。

--14dae93404410f62f404b2e65e10 内容类型：文本/纯文本；字符集=ISO-8859-1 垃圾 123 垃圾 --14dae93404410f62f404b2e65e10 内容类型：文本/html；字符集=ISO-8859-1

垃圾 123 垃圾

--14dae93404410f62f404b2e65e10--

更新： 现在我被迭代器困住了：-/我真的试过了。但我不明白。这段代码：

msg = email.message_from_string(raw_message)
for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
    print part

输出：

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

为什么不直接输出：

Junk 123 Junk

【问题讨论】：

显然您需要向我们提供一些我们可以使用的样本。
你的权利，这是它看起来的一种方式......
使用 body_line_iterator 跳过子部分标题。我将在我的答案中添加一个具体示例。

标签： python regex email

【解决方案1】：

您可能希望使用迭代器跳过子部分标题。

http://docs.python.org/library/email.iterators.html#module-email.iterators

此示例将打印每个消息子部分的正文，即 text/plain：

for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
   for body_line in email.iterators.body_line_iterator(part):
       print body_line

【讨论】：

哦，这似乎是正确的做法。明天晚上我会试试看它是否有效:)
我花了一段时间才让它工作，因为我只使用 imaplib 获取主体。谢谢”！

【解决方案2】：

你可以用这个：

match = re.search(r"Content-Type:.*?[\n\r]+\D*(\d+)", subject)
if match:
    result = match.group(1)

说明：

"
Content-Type:    # Match the characters “Content-Type:” literally
.                # Match any single character that is not a line break character
   *?               # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
[\n\r]           # Match a single character present in the list below
                    # A line feed character
                    # A carriage return character
   +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\D               # Match a single character that is not a digit 0..9
   *                # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(                # Match the regular expression below and capture its match into backreference number 1
   \d               # Match a single digit 0..9
      +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

【讨论】：

我想这将适用于我的示例，但是否会失败取决于哪个客户端发送了消息？
@NiclasNilsson 只要有 Content-Type: 一些空行并且之前没有其他数字序列，它就会起作用。不过我会选择其他解决方案:)
是的。谢谢。但也许我对 Python 电子邮件库感到不安，反正我只是使用正则表达式。这是我的一个爱好项目。