【发布时间】:2016-05-02 19:41:34
【问题描述】:
我正在使用 Python 读取安然电子邮件数据集。我有文本文件中的电子邮件。我想阅读文本文件并仅提取每封电子邮件的“正文”部分。我不关心任何其他FROM、TO、BCC、attachments、DATE 等。我只想要BODY 部分并希望将其存储在列表中。我尝试使用get_payload() 函数,但它仍然会打印所有内容。如何跳过其他内容并仅使用正文部分?
import email.parser
from email.parser import Parser
# Code to extract a particular section from raw emails.
parser = Parser()
text1 = open("path of the file", "r").read()
msg = email.message_from_string(text1)
email = parser.parsestr(text1)
if msg.is_multipart():
for payload in msg.get_payload():
print payload.get_payload()
else:
print msg.get_payload()
一个文件可能包含多封电子邮件。电子邮件示例。
docID: 1
segmentNumber: 0
Body: I just checked with Carolyn on your invoicing for the conference. She
verified the 85K was processed.
##########################################################
docID: 2
segmentNumber: 0
Body: null
##########################################################
docID: 3
segmentNumber: 0
Body: In regard to the costs for the GAM conference, Karen told me the $ 6,695.97
figure was inclusive of all the items for the conference. However, after
speaking with Shweta, I found out this is not the case. The CDs are not
included in this figure.
The CD cost will be $2,011.50 + the cost of postage/handling (which is
currently being tabulated).
##########################################################
docID: 3
segmentNumber: 1
Body:
This is the original quote for this project and it did not include the
postage. As soon as I have the details from the vendor, I'll forward those to
you.
Please call me if you have any questions.
【问题讨论】:
-
嗯,您所显示的是 not 电子邮件格式...在电子邮件中,您的标题以标题名称开头,然后是一个空行,以下是身体。特别是,没有像 Body= 或 Body: 这样的东西。这是一种特定格式,您不应尝试使用电子邮件模块,而应直接对其进行解析。