PHP DOM UTF-8 问题答案

【问题标题】：PHP DOM UTF-8 problemPHP DOM UTF-8 问题
【发布时间】：2011-04-02 17:01:36
【问题描述】：

首先，我的数据库使用 Windows-1250 作为本机字符集。我将数据输出为 UTF-8。我在整个网站上都使用 iconv() 函数将 Windows-1250 字符串转换为 UTF-8 字符串，并且效果很好。

问题是当我使用 PHP DOM 解析存储在数据库中的一些 HTML 时（HTML 是所见即所得编辑器的输出，无效，它没有 html、head、body 标签等）。

HTML 可能看起来像这样，例如：

<p>Hello</p>

这是我用来从数据库中解析某个 HTML 的方法：

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

上述方法的输出是一个垃圾，所有特殊字符都被替换为奇怪的东西，比如 ÃšÄ�。

还有一件事。它在我的开发服务器上工作。

但它在生产服务器上不起作用。

有什么建议吗？

生产服务器PHP版本：PHP Version 5.2.0RC4-dev

开发服务器PHP版本：PHP Version 5.2.13

更新：

我自己正在研究解决方案。我从这个 PHP 错误报告中得到灵感（虽然不是真正的错误）：http://bugs.php.net/bug.php?id=32547

这是我提出的解决方案。明天我会试试看它是否有效：

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

【问题讨论】：

您确定您发送了适当的 Content-type 标头吗？ IE。如果您在 Firefox 中打开页面，请检查 View->Charset Encoding 是否设置为 UTF8。
你试过保存方法吗：$doc->save();
@Cem 我会试试的。等待几分钟。

标签： php utf-8 domdocument iconv

【解决方案1】：

你的“hack”没有意义。

您正在将 Windows-1250 HTML 文件转换为 UTF-8，然后添加 <?xml encoding="UTF-8">。这行不通。 DOM 扩展，用于 HTML 文件：

采用元 http-equiv 中为“content-type”指定的字符集。
否则采用 ISO-8859-1

我建议您改为从 Windows-1250 转换为 ISO-8859-1 并且不添加任何内容。

编辑这个建议不是很好，因为 Windows-1250 包含 ISO-8859-1 中没有的字符。由于您正在处理没有 meta 内容类型元素的片段，因此您可以添加自己的以强制解释为 UTF-8：

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

给予：

string(79) "ĄĽź ‰ ‡ ... á (有些存在于 win-1250，但不存在 LATIN1 甚至 win-1252)"

【讨论】：

如果您使用过非英文数据（cp1250 或其他），您会知道这种 hack 有时是使 PHP DOM 保留 UTF-8 特殊字符的唯一方法。 PHP 文档中也提到了它。您可以尝试创建一个 cp1250 数据库，从那里获取一些数据并使用 PHP DOM 解析它。真的很痛苦。
@Rich "PHP 文档中也提到了。"请链接。用户注释不是文档的一部分。
@Artefacto 这里是用户评论（php.net/manual/en/domdocument.loadhtml.php）。这是从顶部开始的第三条评论。我知道这不是官方的，但有时这是唯一的方法。这不是 Windows-1250 + PHP DOM 组合唯一一次让我头疼。尽管如此，我只是睡了一会儿，我对如何解决这个问题有了一个想法（虽然不确定它是否会起作用）。如果它不起作用，我明天会尝试它，我可能会为这个问题开始赏金。
@Artefacto 我从这里知道可能是什么问题：bugs.php.net/bug.php?id=32547 但我现在去睡觉了。
@Artefacto 如果我解决了这个问题，我可能会第一次在 PHP 文档中添加评论：D

【解决方案2】：

两种解决方案。

您可以将编码设置为标头：

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

或者您可以将其设置为 META 标签：

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

编辑：如果这两个设置正确，请执行以下操作：

创建一个 small 页面，其中包含 UTF-8 字符。
用您已有的方法编写页面。
使用Fiddler 或Wireshark 检查在您的DEV 和PROD 环境中传输的原始字节。您还可以使用 Fiddler/Wireshark 仔细检查标头。

如果您确信正在发送正确的标头，那么发现错误的最佳机会就是开始查看原始字节。发送到相同浏览器的相同字节将产生相同的结果，因此您需要开始寻找它们不同的原因。 Fiddler/Wireshark 将对此有所帮助。

【讨论】：

我不认为这会解决问题，如果它真的与 var_dump 一起工作
他提到它确实在他的开发服务器上工作，这意味着字节很可能被正确写入。从那里最可能的问题是字节没有被正确读取，这应该可以解决这个问题。
标头发送正确。还有正确的元标记。
好的，我会尝试使用提琴手。顺便说一句，我认为问题是由 PHP DOM 引起的。我认为它弄乱了东欧 UTF-8 字符。你知道我可以用来解析 HTML 的 PHP DOM 的替代品吗？

【解决方案3】：

我遇到了同样的问题。我的修复是使用notepad++并将php文档的编码设置为“UTF-8 without BOM”。希望这对其他人有所帮助。

【讨论】：