【问题标题】:How can I parse an email header with python?如何使用 python 解析电子邮件标头?
【发布时间】:2015-07-26 05:38:32
【问题描述】:

这是一个示例电子邮件标题,

header = """
From: Media Temple user (mt.kb.user@gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user@example.com
Return-Path: <mt.kb.user@gmail.com>
Envelope-To: user@example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""

标头存储为字符串,如何解析此标头,以便将其映射到字典,因为标头字段是键,值是字典中的值?

我想要一本这样的字典,

header_dict = {
'From': 'Media Temple user (mt.kb.user@gmail.com)',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. . 
 . . . . .. . . . ..  . . . . .
} 

我列出了必填字段,

header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']

这可以列出项目可能是字典的键。

【问题讨论】:

标签: python dictionary text-processing email-headers python-textprocessing


【解决方案1】:

似乎这些答案中的大多数都忽略了 Python email parser,并且输出结果与值中的前缀空格不正确。此外,OP 可能通过在标题字符串中包含前面的换行符来打错字,这需要剥离电子邮件解析器才能工作。

from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)

输出(截断):

>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': 'January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
 ...
 'Subject': 'article: A sample header',
 'To': 'user@example.com',
 'X-Spam-Level': '***',
 'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

重复的标题键

请注意,电子邮件标头可能包含重复键,如email.message 的 Python 文档中所述

标题以保留大小写的形式存储和返回,但字段名称不区分大小写。与真正的字典不同,键是有顺序的,并且可以有重复的键。提供了其他方法来处理具有重复键的标头。

例如,将以下电子邮件消息转换为 Python 字典,只会保留第一个 Received 键。

headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example@example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example@example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")

dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}

使用get_all 方法检查重复项:

headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example@example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example@example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']

【讨论】:

    【解决方案2】:

    split 会为你工作:

    演示:

    >>> result = {}
    >>> for i in header.split("\n"):
    ...    i = i.strip()
    ...    if i :
    ...       k, v = i.split(":", 1)
    ...       result[k] = v
    

    输出:

    >>> import pprint
    >>> pprint.pprint(result)
    {'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
     'Date': ' January 25, 2011 3:30:58 PM PDT',
     'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
     'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
     'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
     'Envelope-To': ' user@example.com',
     'From': ' Media Temple user (mt.kb.user@gmail.com)',
     'Message Body': ' **The email message body**',
     'Message-Id': ' <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>',
     'Mime-Version': ' 1.0',
     'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700',
     'Return-Path': ' <mt.kb.user@gmail.com>',
     'Subject': ' article: A sample header',
     'To': ' user@example.com',
     'X-Spam-Level': ' ***',
     'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
    

    【讨论】:

    • 您可以使用header.splitlines(),它也会删除换行符。
    • @PadraicCunningham:是的。它正在删除最后一个空白新行,但不是第一个。例如&gt;&gt;&gt; s = """\n1\n2\n3\n""" &gt;&gt;&gt; s.splitlines() ['', '1', '2', '3'] &gt;&gt;&gt; 所以最好在拆分前先剥离。对吗?
    • 第一个换行符可能实际上并不存在,这正是 OP 设置输入的方式。 """From: Media Temple user (mt.kb.user@gmail.com) 将是字符串的实际开头。无论如何,plus1,你得到了正确的分割
    • @PadraicCunningham:好的。可以解释更多关于你的代码吗?表示任何链接。生成器对象被创建,然后你创建字典。
    • 每一行被分成列表,Subject: article: A sample header -&gt; ["Subject:", "article: A sample header"],尝试从解释器运行dict([["Subject:", "article: A sample header"]]),你会看到会发生什么,在我的代码中发生的是你有多个子列表
    【解决方案3】:
    header = """From: Media Temple user (mt.kb.user@gmail.com)
    Subject: article: A sample header
    Date: January 25, 2011 3:30:58 PM PDT
    To: user@example.com
    Return-Path: <mt.kb.user@gmail.com>
    Envelope-To: user@example.com
    Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
    Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700
    Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
    Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
    Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>
    Mime-Version: 1.0
    Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
    X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
    X-Spam-Level: ***
    Message Body: **The email message body**
    """   
    

    拆分成单独的行,然后在:上将每一行拆分一次

    from pprint import pprint as pp
    pp(dict(line.split(":",1) for line in header.splitlines()))
    

    输出:

    {'Content-Type': ' multipart/alternative; '
                     'boundary="----=_Part_3927_12044027.1214951458678"',
     'Date': ' January 25, 2011 3:30:58 PM PDT',
     'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
     'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
                       's=gamma; '
                       'h=domainkey-signature:received:received:message-id:date:from:to '
                       ':subject:mime-version:content-type; '
                       'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
                       'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
                       'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
                       'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
     'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
                            'h=message-id:date:from:to:subject:mime-version:content-type; '
                            'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
                            '36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
                            '6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
     'Envelope-To': ' user@example.com',
     'From': ' Media Temple user (mt.kb.user@gmail.com)',
     'Message Body': ' **The email message body**',
     'Message-Id': ' '
                   '<c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>',
     'Mime-Version': ' 1.0',
     'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
                 'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
                 '<mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for '
                 'user@example.com; Tue, 25 Jan 2011 15:31:01 -0700',
     'Return-Path': ' <mt.kb.user@gmail.com>',
     'Subject': ' article: A sample header',
     'To': ' user@example.com',
     'X-Spam-Level': ' ***',
     'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                      'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
    

    line.split(":",1) 确保我们只在: 上拆分一次,因此如果值中有任何:,我们最终也不会拆分它。您最终会得到作为键/值对的子列表,因此调用 dict 会从每个配对中创建 dict

    【讨论】:

    • @VivekSable,这可能是因为 OP 在第一行之前有一个换行符,请执行header.splitlines()[1:]
    【解决方案4】:

    你可以在换行符上分割字符串,然后在“:”上分割每一行

    >>> my_header = {}
    >>> for x in header.strip().split("\n"):
    ...     x = x.split(":", 1)
    ...     my_header[x[0]] = x[1]
    ... 
    

    【讨论】:

    • 'Date': 'January 25, 2011 3:30:58 PM PDT' 这将根据您的代码工作吗?因为拆分后x[0] 是键,x[1] 是值,所以结果将是'Date': ' January 25, 2011 3'
    • @VivekSable 还没有看到这种日期格式,现在更新了:),谢谢
    猜你喜欢
    • 2016-12-18
    • 2020-09-18
    • 2018-02-23
    • 2011-03-04
    • 1970-01-01
    • 2022-12-17
    • 2016-06-09
    • 2020-10-25
    相关资源
    最近更新 更多