如何仅从多部分电子邮件中获取文本内容？答案

【问题标题】：How do I get just the text content from a multipart email?如何仅从多部分电子邮件中获取文本内容？
【发布时间】：2011-04-09 09:41:52
【问题描述】：

    #!/usr/bin/php -q
    <?php
    $savefile = "savehere.txt";
    $sf = fopen($savefile, 'a') or die("can't open file");
    ob_start();

    // read from stdin
    $fd = fopen("php://stdin", "r");
    $email = "";
    while (!feof($fd)) {
        $email .= fread($fd, 1024);
    }
    fclose($fd);
    // handle email
    $lines = explode("\n", $email);

    // empty vars
    $from = "";
    $subject = "";
    $headers = "";
    $message = "";
    $splittingheaders = true;

    for ($i=0; $i < count($lines); $i++) {
        if ($splittingheaders) {
            // this is a header
            $headers .= $lines[$i]."\n";

            // look out for special headers
            if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
                $subject = $matches[1];
            }
            if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
                $from = $matches[1];
            }
            if (preg_match("/^To: (.*)/", $lines[$i], $matches)) {
                $to = $matches[1];
            }
        } else {
            // not a header, but message
            $message .= $lines[$i]."\n";




        }

        if (trim($lines[$i])=="") {
            // empty line, header section has ended
            $splittingheaders = false;
        }
    }
/*$headers is ONLY included in the result at the last section of my question here*/
    fwrite($sf,"$message");
    ob_end_clean();
    fclose($sf);
    ?>

这是我尝试的一个例子。问题是我在文件中得到了太多。这是写入文件的内容：（如您所见，我刚刚向它发送了一堆垃圾）

From xxxxxxxxxxxxx Tue Sep 07 16:26:51 2010
Received: from xxxxxxxxxxxxxxx ([xxxxxxxxxxx]:3184 helo=xxxxxxxxxxx)
    by xxxxxxxxxxxxx with esmtpa (Exim 4.69)
    (envelope-from <xxxxxxxxxxxxxxxx>)
    id 1Ot4kj-000115-SP
    for xxxxxxxxxxxxxxxxxxx; Tue, 07 Sep 2010 16:26:50 -0400
Message-ID: <EE3B7E26298140BE8700D9AE77CB339D@xxxxxxxxxxx>
From: "xxxxxxxxxxxxx" <xxxxxxxxxxxxxx>
To: <xxxxxxxxxxxxxxxxxxxxx>
Subject: stackoverflow is helping me
Date: Tue, 7 Sep 2010 16:26:46 -0400
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="----=_NextPart_000_0169_01CB4EA9.773DF5E0"
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
X-Mailer: Microsoft Windows Live Mail 14.0.8089.726
X-MIMEOLE: Produced By Microsoft MimeOLE V14.0.8089.726

This is a multi-part message in MIME format.

------=_NextPart_000_0169_01CB4EA9.773DF5E0
Content-Type: text/plain;
    charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

111
222
333
444
------=_NextPart_000_0169_01CB4EA9.773DF5E0
Content-Type: text/html;
    charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3Dtext/html;charset=3Diso-8859-1 =
http-equiv=3DContent-Type>
<META name=3DGENERATOR content=3D"MSHTML 8.00.6001.18939"></HEAD>
<BODY style=3D"PADDING-LEFT: 10px; PADDING-RIGHT: 10px; PADDING-TOP: =
15px"=20
id=3DMailContainerBody leftMargin=3D0 topMargin=3D0 =
CanvasTabStop=3D"true"=20
name=3D"Compose message area">
<DIV><FONT face=3DCalibri>111</FONT></DIV>
<DIV><FONT face=3DCalibri>222</FONT></DIV>
<DIV><FONT face=3DCalibri>333</FONT></DIV>
<DIV><FONT face=3DCalibri>444</FONT></DIV></BODY></HTML>

------=_NextPart_000_0169_01CB4EA9.773DF5E0--

我在四处搜索时发现了这个，但不知道如何实现或在我的代码中插入的位置，或者它是否可以工作。

preg_match("/boundary=\".*?\"/i", $headers, $boundary);
$boundaryfulltext = $boundary[0];

if ($boundaryfulltext!="")
{
$find = array("/boundary=\"/i", "/\"/i");
$boundarytext = preg_replace($find, "", $boundaryfulltext);
$splitmessage = explode("--" . $boundarytext, $message);
$fullmessage = ltrim($splitmessage[1]);
preg_match('/\n\n(.*)/is', $fullmessage, $splitmore);

if (substr(ltrim($splitmore[0]), 0, 2)=="--")
{
$actualmessage = $splitmore[0];
}
else
{
$actualmessage = ltrim($splitmore[0]);
}

}
else
{
$actualmessage = ltrim($message);
}

$clean = array("/\n--.*/is", "/=3D\n.*/s");
$cleanmessage = trim(preg_replace($clean, "", $actualmessage));

那么，我怎样才能将电子邮件的纯文本区域放入我的文件或脚本中以进行进一步处理？？

提前致谢。 stackoverflow 很棒！

【问题讨论】：

这是完整的电子邮件吗？它缺少 Content-Type: multipart/mixed 标头，它应该指定边界字符串是什么（您找到的代码需要）。
这只是电子邮件中保存到文件的部分。这是我使用第一个代码示例所能得到的最精简的。
边界标头对于能够解析您的电子邮件很重要，因为它指定了电子邮件的每个部分开始和结束的位置。没有它，您所能做的就是猜测，并且您知道他们所说的假设... ;) 例如，对于您引用的电子邮件，应该有一个标题，如：Content-Type: multipart/mixed; boundary="----=_NextPart_000_0163_01CB4EA5.46466520"
来自不同的基于 PC 的电子邮件客户端或流行的免费电子邮件帐户的界限是否相同？
我将 headers var 添加到文件 write 并编辑了我的问题，以便为你们/gals 添加该信息...

标签： php email-parsing

【解决方案1】：

您必须采取四个步骤来隔离电子邮件正文的纯文本部分：

1.获取 MIME 边界字符串

我们可以使用正则表达式来搜索您的标题（假设它们在一个单独的变量中，$headers）：

$matches = array();
preg_match('#Content-Type: multipart\/[^;]+;\s*boundary="([^"]+)"#i', $headers, $matches);
list(, $boundary) = $matches;

正则表达式将搜索包含边界字符串的Content-Type 标头，然后将其捕获到第一个capture group 中。然后我们将该捕获组复制到变量$boundary。

2。将电子邮件正文拆分为多个部分

一旦我们有了边界，我们就可以将正文分成不同的部分（在您的消息正文中，正文将在每次出现时以-- 开头）。根据MIME spec，第一个边界之前的所有内容都应该被忽略。

$email_segments = explode('--' . $boundary, $message);
array_shift($email_segments); // drop everything before the first boundary

这将为我们留下一个包含所有段的数组，忽略第一个边界之前的所有内容。

3.确定哪个段是纯文本。

纯文本段将有一个Content-Type 标头，MIME 类型为text/plain。我们现在可以在每个段中搜索具有该标题的第一个段：

foreach ($email_segments as $segment)
{
  if (stristr($segment, "Content-Type: text/plain") !== false)
  {
    // We found the segment we're looking for!
  }
}

由于我们要查找的是一个常量，我们可以使用stristr（它在字符串中查找子字符串的第一个实例，不区分大小写）而不是正则表达式。如果找到 Content-Type 标头，我们就得到了我们的段。

4.从段中删除所有标题

现在我们需要从我们找到的段中删除所有标题，因为我们只想要实际的消息内容。这里可以出现四个MIME headers：Content-Type，正如我们之前看到的，Content-ID、Content-Disposition 和Content-Transfer-Encoding。标头由\r\n 终止，因此我们可以使用它来确定标头的结尾：

$text = preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\r\n/is', "", $segment);

正则表达式末尾的s modifier 使点匹配任何换行符。 .*? 将收集尽可能少的字符（即直到 \r\n 的所有字符）； ? 是 lazy modifier 上的 .*。

在此之后，$text 将包含您的电子邮件内容。

所以把它和你的代码放在一起：

<?php
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd))
{
    $email .= fread($fd, 1024);
}
fclose($fd);

$matches = array();
preg_match('#Content-Type: multipart\/[^;]+;\s*boundary="([^"]+)"#i', $email, $matches);
list(, $boundary) = $matches;

$text = "";
if (isset($boundary) && !empty($boundary)) // did we find a boundary?
{
  $email_segments = explode('--' . $boundary, $email);

  foreach ($email_segments as $segment)
  {
    if (stristr($segment, "Content-Type: text/plain") !== false)
    {
      $text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\r\n/is', "", $segment));
      break;
    }
  }
}

// At this point, $text will either contain your plain text body,
// or be an empty string if a plain text body couldn't be found.

$savefile = "savehere.txt";
$sf = fopen($savefile, 'a') or die("can't open file");
fwrite($sf, $text);
fclose($sf);
?>

【讨论】：

我开始明白了，我想.. 所以，为了测试我会在 //empty vars 之后替换所有内容？？？
不完全是。这取决于您想要做什么（例如，您可能想要继续拆分标题或收集“特殊”标题）。我的代码希望您有一个文本块作为标题，一个文本用于消息，但您可以将我的代码中的 $headers 和 $message 替换为 $email，根据您的代码应该包含整个电子邮件。
啊，我不明白！如何在上面的代码示例中实现这一点，以便进行测试？我会把你的 sn-p 放在文件之前吗？然后写$text而不是$message？非常感谢您对这位初学者的帮助和耐心。
我更新了我的代码以读取电子邮件（根据您的代码）并处理它。我的代码 sn-p 应该按照您想要的方式工作，而无需进行任何修改。如果你想对这封电子邮件做任何其他事情，我会留给你（或者你可以在这里提出另一个问题以获得进一步的帮助）。
旧帖子，但我想我会从我发现的错误中添加一个快速更新。在第 3 步中，我发现正则表达式与多部分标题不匹配，因为它们后面并不总是有回车符。如果您删除该 preg 中的 '\r'，我相信它适用于所有情况（因为如果有一个，它将被 '.*?' 捕获）。所以新的看起来像 $text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\n/is', "", $segment));

【解决方案2】：

有一个答案here：

你只需要改变这两行：

require_once('/path/to/class/rfc822_addresses.php');
require_once('/path/to/class/mime_parser.php');

【讨论】：

@james.garriss 不再（在撰写此评论时）