我如何在 HTML 标签中找到 100% 确定的 JS？答案

【问题标题】：How i can find 100% sure a JS inside of HTML tag?我如何在 HTML 标签中找到 100% 确定的 JS？
【发布时间】：2023-11-18 17:08:01
【问题描述】：

我需要用一些HTML标签保存一些数据，所以我不能对所有文本使用strip_tags，也不能使用htmlentities，因为文本必须通过标签修改。为了保护其他用户免受 XSS 攻击，我必须从标签内部删除所有 JavaScript。

最好的方法是什么？

【问题讨论】：

*.com/questions/1886740/php-remove-javascript
如果您希望使用 JavaScript 进行过滤，*.com/questions/295566/… 已提出类似问题。

标签： php javascript html security xss

【解决方案1】：

如果您需要在数据库中保存 HTML 标记，而后者想将其打印回浏览器，则使用内置 PHP 函数没有 100% 安全的方法来实现此目的。当没有 html 标签时很容易，当您只有文本时，您可以使用内置的 PHP 函数来清除文本。

有一些功能可以从文本中清除 XSS，但它们不是 100% 安全的，而且 XSS 总有办法不被注意。而且您的正则表达式示例很好，但是如果我使用< script>alert('xss')</script> 或其他一些正则表达式可能会错过并且浏览器会执行的组合。

最好的方法是使用 HTML Purifier 之类的东西

还请注意，有两个级别的安全性，第一个是当数据进入您的数据库时，第二个是当它们离开您的数据库时。

希望这会有所帮助！

【讨论】：

有有 100% 安全的方法来做到这一点，使用 HTML 解析器（实际解析器，而不是基于正则表达式的解析器）以及标签和属性白名单。所有 Stack Exchange 网站都这样做。
我的答案中没有链接 HTML Purifier 吗？ :) 我说过使用 PHP 内置函数或使用正则表达式不是 100% 安全的。
我主要针对您答案的第一段。
哦，你是 100% 正确的，现在当我再次阅读时，我明白你的意思了。我以错误的方式表达自己。我的错！
谢谢，我会试试 HTML Purifier，但我在任何地方都找不到简单的书面示例，例如 $safer_text = function($_POST['textarea'],$allowed_tags);。顺便提一句。允许的标签变量，它必须怎么看出来？

【解决方案2】：

如果你想允许特定的标签，你必须解析 HTML。

为此已经有一个不错的库：HTML Purifier（LGPL 下的开源）

【讨论】：

【解决方案3】：

我建议你使用DOMDocument（和loadHTML）来加载HTML，删除你不想看到的所有类型的标签和属性，并保存回HTML（使用saveXML或@ 987654327@)。您可以通过递归地遍历文档根的子节点，并用它们的内部内容替换您不想要的标签来做到这一点。由于loadHTML 加载代码的方式与浏览器类似，因此它比使用正则表达式更安全。

编辑这是我制作的“净化”功能：

<?php

function purifyNode($node, $whitelist)
{
    $children = array();
    // copy childNodes since we're going to iterate over it and modify the collection
    foreach ($node->childNodes as $child)
        $children[] = $child;

    foreach ($children as $child)
    {
        if ($child->nodeType == XML_ELEMENT_NODE)
        {
            purifyNode($child, $whitelist);
            if (!isset($whitelist[strtolower($child->nodeName)]))
            {
                while ($child->childNodes->length > 0)
                    $node->insertBefore($child->firstChild, $child);

                $node->removeChild($child);
            }
            else
            {
                $attributes = $whitelist[strtolower($child->nodeName)];
                // copy attributes since we're going to iterate over it and modify the collection
                $childAttributes = array();
                foreach ($child->attributes as $attribute)
                    $childAttributes[] = $attribute;

                foreach ($childAttributes as $attribute)
                {
                    if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value))
                        $child->removeAttribute($attribute->name);
                }
            }
        }
    }
}

function purifyHTML($html, $whitelist)
{
    $doc = new DOMDocument();
    $doc->loadHTML($html);

    // make sure <html> doesn't have any attributes
    while ($doc->documentElement->hasAttributes())
        $doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0));

    purifyNode($doc->documentElement, $whitelist);
    $html = $doc->saveHTML();
    $fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html>
    return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1
}

?>

您将调用purifyHTML 并使用不安全的HTML 字符串和预定义的标签和属性白名单。白名单格式为 'tag' => array('attribute' => 'regex')。白名单中不存在的标签将被剥离，其内容内联在父标签中。白名单中给定标签不存在的属性也将被删除；白名单中存在但与正则表达式不匹配的属性也会被删除。

这是一个例子：

<?php

$html = <<<HTML
<p>This is a paragraph.</p>
<p onclick="alert('xss')">This is an evil paragraph.</p>
<p><a href="javascript:evil()">Evil link</a></p>
<p><script>evil()</script></p>
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
HTML;

// whitelist format: tag => array(attribute => regex)
$whitelist = array(
    'b' => array(),
    'i' => array(),
    'u' => array(),
    'p' => array(),
    'img' => array('src' => '@\Ahttp://.+\Z@', 'alt' => '@.*@'),
    'a' => array('href' => '@\Ahttp://.+\Z@')
);

$purified = purifyHTML($html, $whitelist);
echo $purified;

?>

结果是：

<p>This is a paragraph.</p>
<p>This is an evil paragraph.</p>
<p><a>Evil link</a></p>
<p>evil()</p>
<p>This is an evil image: <img></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>

显然，您不想允许任何on* 属性，我建议不要使用style，因为behavior 等奇怪的专有属性。确保所有 URL 属性都使用 匹配完整字符串 (\Aregex\Z) 的正则表达式进行验证。

【讨论】：

它会与 HTML 片段一起工作，还是会尝试创建一个完整的文档、<html> 标记等等？
@cHao，它会尝试创建一个完整的文档，但你只需要遍历<body> 里面的内容。此外，如果您使用递归方法并且不将 html 和 body 列入白名单，它应该像片段一样工作。
@Hogan，如果可以的话，我会删除答案。

【解决方案4】：

我为此编写了这段代码，您可以设置删除的标签和属性列表

function RemoveTagAttribute($Dom,$Name){
    $finder = new DomXPath($Dom);
    if(!is_array($Name))$Name=array($Name);
    foreach($Name as $Attribute){
        $Attribute=strtolower($Attribute);
        do{
          $tag=$finder->query("//*[@".$Attribute."]");
          //print_r($tag);
          foreach($tag as $T){
            if($T->hasAttribute($Attribute)){
               $T->removeAttribute($Attribute);
            }
          }
        }while($tag->length>0);  
    }
    return $Dom;

}
function RemoveTag($Dom,$Name){
    if(!is_array($Name))$Name=array($Name);
    foreach($Name as $tagName){
        $tagName=strtolower($tagName);
        do{
          $tag=$Dom->getElementsByTagName($tagName);
          //print_r($tag);
          foreach($tag as $T){
            //
            $T->parentNode->removeChild($T);
          }
        }while($tag->length>0);
    }
    return $Dom;

}

示例：

  $dom= new DOMDocument; 
   $HTML = str_replace("&", "&amp;", $HTML);  // disguise &s going IN to loadXML() 
  // $dom->substituteEntities = true;  // collapse &s going OUT to transformToXML() 
   $dom->recover = TRUE;
   @$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML); 
   // dirty fix
   foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
      $dom->removeChild($item); // remove hack
   $dom->encoding = 'UTF-8'; // insert proper
  $dom=RemoveTag($dom,"script");
  $dom=RemoveTagAttribute($dom,array("onmousedown","onclick"));
  echo $dom->saveHTML();

【讨论】：