【问题标题】:Parsing email with Python使用 Python 解析电子邮件
【发布时间】:2011-03-04 06:42:08
【问题描述】:

我正在编写一个 Python 脚本来处理从Procmail 返回的电子邮件。正如question 中所建议的那样,我正在使用以下 Procmail 配置:

:0:
|$HOME/process_mail.py

我的 process_mail.py 脚本正在通过标准输入接收一封电子邮件,如下所示:

From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
Subject: TEST 12
From: Full Name <username@sender.com>
To: username@domain.com
Content-Type: text/plain; charset=ISO-8859-1

ONE
TWO
THREE

我正在尝试以这种方式解析消息:

>>> import email
>>> msg = email.message_from_string(full_message)

我想获取“发件人”、“收件人”和“主题”等消息字段。但是,消息对象不包含任何这些字段。

我做错了什么?

【问题讨论】:

    标签: python email parsing mime


    【解决方案1】:

    我自己回答。

    我在构建消息的代码中发现了一个错误。它在某些行之间添加换行符,从而阻止解析器正常工作。

    【讨论】:

      【解决方案2】:

      看起来你的换行符没有在附加行前加上空格,根据RFC 2822 §2.3.2,这是非法的:

      每个标题字段在逻辑上都是单行字符,包括
      字段名称、冒号和字段正文。为了方便
      但是,为了处理每行 998/78 个字符的限制,
      标头字段的字段主体部分可以拆分为多个
      线表示;这称为“折叠”。一般规则是
      只要这个标准允许折叠空白(不是
      简单的 WSP 字符),可以在任何 WSP 之前插入 CRLF。对于
      比如头域:

          Subject: This is a test
      

      可以表示为:

          Subject: This
           is a test
      

      它应该看起来像这样:

      From hostname Tue Jun 15 21:43:30 2010
      Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
      Received: from mail-fx0-f44.google.com (209.85.161.44)
          by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
      Received: by fxm19 with SMTP id 19so170709fxm.3
          for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
      MIME-Version: 1.0
      Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
          Jun 2010 18:47:33 -0700 (PDT)
      Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
      Date: Tue, 15 Jun 2010 20:47:33 -0500
      Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
      Subject: TEST 12
      From: Full Name <username@sender.com>
      To: username@domain.com
      Content-Type: text/plain; charset=ISO-8859-1
      
      ONE
      TWO
      THREE
      

      【讨论】:

      • 所以澄清一下,如果原始文件显示Subject: This\r\n is a test,那么email.message_from_string()应该说主题是This is a test(没有空格)。我发现对于带有此类附件名称 (Content-Disposition) 包装的特定电子邮件,有趣的 \r\n 不会被删除。
      【解决方案3】:

      您必须确保线条不会被意外损坏(如上所示,但很难说这是否是复制粘贴问题)——带有完整的消息,例如:

      Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
      Received: from mail-fx0-f44.google.com (209.85.161.44) by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
      Received: by fxm19 with SMTP id 19so170709fxm.3 for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
      MIME-Version: 1.0
      Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
      Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
      Date: Tue, 15 Jun 2010 20:47:33 -0500
      Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
      Subject: TEST 12
      From: Full Name <username@sender.com>
      To: username@domain.com
      Content-Type: text/plain; charset=ISO-8859-1
      
      ONE
      TWO
      THREE
      

      然后

      msg = email.message_from_string(msgtxt)
      print msg['Subject']
      

      根据需要打印TEST 12

      【讨论】:

      • 如何在这里获取邮件正文?
      • 如果你真的想要包含原始 MIME 结构的整个 RFC2822 电子邮件正文,用 Python 解析邮件基本上是多余的;正文是第一个空行之后的所有内容。通常,对于现代消息,您希望解析 MIME 结构并提取一个或多个正文部分。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-07-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多