我建议你使用DOMDocument(和loadHTML)来加载HTML,删除你不想看到的所有类型的标签和属性,并保存回HTML(使用saveXML或@ 987654327@)。您可以通过递归地遍历文档根的子节点,并用它们的内部内容替换您不想要的标签来做到这一点。由于loadHTML 加载代码的方式与浏览器类似,因此它比使用正则表达式更安全。
编辑这是我制作的“净化”功能:
<?php
function purifyNode($node, $whitelist)
{
$children = array();
// copy childNodes since we're going to iterate over it and modify the collection
foreach ($node->childNodes as $child)
$children[] = $child;
foreach ($children as $child)
{
if ($child->nodeType == XML_ELEMENT_NODE)
{
purifyNode($child, $whitelist);
if (!isset($whitelist[strtolower($child->nodeName)]))
{
while ($child->childNodes->length > 0)
$node->insertBefore($child->firstChild, $child);
$node->removeChild($child);
}
else
{
$attributes = $whitelist[strtolower($child->nodeName)];
// copy attributes since we're going to iterate over it and modify the collection
$childAttributes = array();
foreach ($child->attributes as $attribute)
$childAttributes[] = $attribute;
foreach ($childAttributes as $attribute)
{
if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value))
$child->removeAttribute($attribute->name);
}
}
}
}
}
function purifyHTML($html, $whitelist)
{
$doc = new DOMDocument();
$doc->loadHTML($html);
// make sure <html> doesn't have any attributes
while ($doc->documentElement->hasAttributes())
$doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0));
purifyNode($doc->documentElement, $whitelist);
$html = $doc->saveHTML();
$fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html>
return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1
}
?>
您将调用purifyHTML 并使用不安全的HTML 字符串和预定义的标签和属性白名单。白名单格式为 'tag' => array('attribute' => 'regex')。白名单中不存在的标签将被剥离,其内容内联在父标签中。白名单中给定标签不存在的属性也将被删除;白名单中存在但与正则表达式不匹配的属性也会被删除。
这是一个例子:
<?php
$html = <<<HTML
<p>This is a paragraph.</p>
<p onclick="alert('xss')">This is an evil paragraph.</p>
<p><a href="javascript:evil()">Evil link</a></p>
<p><script>evil()</script></p>
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
HTML;
// whitelist format: tag => array(attribute => regex)
$whitelist = array(
'b' => array(),
'i' => array(),
'u' => array(),
'p' => array(),
'img' => array('src' => '@\Ahttp://.+\Z@', 'alt' => '@.*@'),
'a' => array('href' => '@\Ahttp://.+\Z@')
);
$purified = purifyHTML($html, $whitelist);
echo $purified;
?>
结果是:
<p>This is a paragraph.</p>
<p>This is an evil paragraph.</p>
<p><a>Evil link</a></p>
<p>evil()</p>
<p>This is an evil image: <img></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
显然,您不想允许任何on* 属性,我建议不要使用style,因为behavior 等奇怪的专有属性。确保所有 URL 属性都使用 匹配完整字符串 (\Aregex\Z) 的正则表达式进行验证。