【问题标题】:DOMDocument:loadHTML() is converting htmlentitiesDOMDocument:loadHTML() 正在转换 htmlentities
【发布时间】:2026-01-19 05:15:02
【问题描述】:

一个相关的问题是Preventing DOMDocument::loadHTML() from converting entities,但它没有产生解决方案。

这段代码:

$html = "<span>&#x1F183;&#x1F174;&#x1F182;&#x1F183;</span>";
$doc = new DOMDocument;
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadhtml($html);
foreach ($doc->getElementsByTagName('span') as $node)
{
    var_dump($node->nodeValue);
    var_dump(htmlentities($node->nodeValue));
    var_dump(htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)));
}

生成此 HTML:

string(16) "????????????????"
string(16) "????????????????"
string(0) ""

但我想要的是&amp;#x1F183;&amp;#x1F174;&amp;#x1F182;&amp;#x1F183;

我正在运行 PHP 版本 5.6.29,ini_get("default_charset") 返回 UTF-8

【问题讨论】:

    标签: php xml domdocument html-entities php-5.6


    【解决方案1】:

    阅读更多关于http://php.net/manual/en/function.htmlentities.php 的内容后,我注意到它并没有编码所有的unicode。有人在 cmets 中写了superentities,但该功能似乎对我不起作用。 UTF8entities 函数做到了。

    这是我从评论部分和代码修改的两个函数,虽然不是我想要的,但它确实给了我 html 编码的值。

    $html = "<span>&#x1F183;&#x1F174;&#x1F182;&#x1F183;</span>";
    $doc = new DOMDocument;
    $doc->resolveExternals = false;
    $doc->substituteEntities = false;
    $doc->loadhtml($html);
    foreach ($doc->getElementsByTagName('span') as $node)
    {
        var_dump(UTF8entities($node->nodeValue));
    }
    
    
    function UTF8entities($content="") {        
        $characterArray = preg_split('/(?<!^)(?!$)/u', $content );  // return array of every multi-byte character
        foreach ($characterArray as $character) {
            $rv .= unicode_entity_replace($character);
        }
        return $rv;
    }
    
    function unicode_entity_replace($c) { //m. perez 
        $h = ord($c{0});    
        if ($h <= 0x7F) { 
            return $c;
        } else if ($h < 0xC2) { 
            return $c;
        }
    
        if ($h <= 0xDF) {
            $h = ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
            $h = "&#" . $h . ";";
            return $h; 
        } else if ($h <= 0xEF) {
            $h = ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6 | (ord($c{2}) & 0x3F);
            $h = "&#" . $h . ";";
            return $h;
        } else if ($h <= 0xF4) {
            $h = ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12 | (ord($c{2}) & 0x3F) << 6 | (ord($c{3}) & 0x3F);
            $h = "&#" . $h . ";";
            return $h;
        }
    }
    

    返回这个:

    string(36) "&amp;#127363;&amp;#127348;&amp;#127362;&amp;#127363;"

    【讨论】: