我看到了我之前的答案的不足之处。这是在<pre> 标签内保留标签的解决方法:
<?php
$test = file_get_contents('input.html');
$dom = new DOMDocument('1.0');
$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
$pre = $xpath->query('//pre//text()');
// manipulate nodes of type XML_TEXT_NODE
foreach($pre as $e) {
$e->nodeValue = str_replace(' ', '__REPLACEMELATER__', $e->nodeValue);
// when you attempt to write in a dom node
// the & will be converted to & :(
}
$temp = $dom->saveHTML();
$temp = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">', '', $temp);
$temp = str_replace('<html>', '', $temp);
$temp = str_replace('<body>', '', $temp);
$temp = str_replace('</body>', '', $temp);
$temp = str_replace('</html>', '', $temp);
$temp = str_replace('__REPLACEMELATER__', ' ', $temp);
echo $temp;
?>
输入
<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
<pre>preformatted 2</pre>
</div>
<div>
<pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
<pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>
输出
<p>paragraph 1 remains untouched</p>
<pre>preformatted 1</pre>
<div>
<pre>preformatted 2</pre>
</div>
<div>
<pre>preformatted 3 <span class="foo">span text</span> preformatted 3</pre>
</div>
<div>
<pre>preformatted 4 <span class="foo">span <b class="bla">bold test</b> text</span> preformatted 3</pre>
</div>
注意 #1
DOMDocument::saveHTML() PHP >= 5.3.6 中的方法允许您指定要输出的节点。否则,您可以使用str_replace() 或preg_replace() 来分隔doctype、html 和body 标签。
注意 #2
这个技巧似乎行得通,并减少了一行代码,但我不确定它是否能保证工作:
$e->nodeValue = utf8_encode(str_replace(' ', "\xA0", $e->nodeValue));
// dom library will attempt to convert 0xA0 to
// nodeValue expects utf-8 encoded data but 0xA0 is not valid in this encoding
// hence replaced string must be utf-8 encoded