在 PHP 中使用 DOMDocument 解析 HTML答案

【问题标题】：parse HTML using DOMDocument in PHP在 PHP 中使用 DOMDocument 解析 HTML
【发布时间】：2011-10-07 04:30:16
【问题描述】：

我想使用 DOMDocument 来解析来自 Rich-Text-Editor 的 sting，我需要的正是：

1) 只允许 (div, p, span, b, ul, ol, li, blockquotem br) 标签，删除其他标签及其内容

编辑： 我正在为此使用 strip_tags()

2) 只允许以下样式：

style="font-weight:bold"
style="font-style: 斜体"
style="text-decoration: 下划线"

3) 删除允许的标签中的所有属性，如 class、id ...等，仅对齐属性除外

有什么想法吗？

【问题讨论】：

关于您的第二点：如果一个元素同时具有bold 和italic 样式会怎样？如果它是一个<ul> 或<li> 元素怎么办，因为将其更改为 或 标签会改变它的工作方式。最后，我要指出  标签已被弃用；建议改用text-decoration:underline 样式。
@Spudley 很好，我编辑问题只允许这些样式
见问题stackoverflow.com/questions/4979836/domdocument-in-php/…

标签： php html dom html-parsing domdocument

【解决方案1】：

出于安全原因，我建议不要尝试使用 DOMDocument 过滤 HTML 输入，尤其是考虑到cross-site scripting 的风险。您可以使用HTML Purifier 之类的过滤器库轻松满足您在 1 和 3 中的要求。由于 Spudley 提到的原因，第 2 项要困难一些。我首先在 HTML Purifier 中将这些样式属性列入白名单，然后在过滤后使用一些逻辑来扫描它们，在该元素中添加适当的标签。

这是一个使用 HTML Purifier 的示例（取自 basic.php）。我唯一更改的是 HTML.AllowedAttributes 和 HTML.AllowedElements 设置。

<?php
// replace this with the path to the HTML Purifier library
require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();

// configuration goes here:
$config->set('Core.Encoding', 'UTF-8'); // replace with your encoding
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); // replace with your doctype
$config->set('HTML.AllowedAttributes', '*.style, align');
$config->set('HTML.AllowedElements', 'div, p, span, b, ul, ol, li, blockquote, br');
$config->set('CSS.AllowedProperties', 'font-weight, font-style, text-decoration');


$purifier = new HTMLPurifier($config);

$html = '<div align="center" style="font-style:italic; color: red" title="removeme">Allowed</div> <img src="not_allowed.jpg" /> <script>not allowed</script>';

$filteredHtml = $purifier->purify($html);
echo '<pre>' . htmlspecialchars($filteredHtml) . '</pre>';

哪些输出：

<div align="center" style="font-style:italic;">Allowed</div>,

【讨论】：

HTML Purifier 附带了一些基本的示例代码 (docs/example/basic.php) 来帮助您入门，网上有很多 documentation。
请发布使用 HTML Purifier 的示例代码，以确保它可以满足我的需求
@D3VELOPER：添加了一个使用您的标签的示例。你应该能够从那里弄清楚。如果您想更改任何其他配置，请查看this。
但这允许所有样式规则我只想允许 3 个样式规则
添加到上面的代码 sn-p - 请查看我链接的文档。它在 CSS 部分。

【解决方案2】：

由于您只想允许少量 HTML 元素，您可以考虑在将 HTML 代码提供给 DOMDocument 类之前使用 PHP strip_tags() function 清理 HTML 代码。

这肯定比自己解析 DOM 来查找需要剥离的元素更容易。

这应该处理您问题的第 1 部分。

它不会涉及第 2 部分或第 3 部分，但这是一个好的开始。

【讨论】：

我已经这样做了 :) 很好，我希望找到第 2 部分和第 3 部分的答案
@D3VELOPER - 请参阅我对问题的评论，了解我在您的第 2 部分计划中看到的缺陷。

【解决方案3】：

我有一个完全可以做到这一点的代码，但它没有记录，并且使用了一些我不拥有但在公共领域的代码。它非常易于使用，并确保所有标签都已关闭，因此它们不会影响您的代码，请为此使用 fix_html 函数。它还可以为此限制标签和属性strip_tags_attributes 的使用，也可以使用strip_javascript 删除任何类型的javascript 功能。我广泛使用了这个，但老实说，我不知道这个是否来自生产。对于您的第二个答案，我想最好将样式全部删除，以便他们可以随意使用 或。并且请不要让任何人使用下划线。

function findNodeValue($parent, $node) {
    $nodes=array();
    if(!is_a($parent, "DOMElement")) return NULL;

    foreach($parent->childNodes as $child)
        if($child->nodeName==$node) $nodes[]=$child;

    if(count($nodes)==0) return NULL;
    if(count($nodes)==1) return $nodes[0]->nodeValue;
    else {
        $ret=array();
        foreach($nodes as $node)
            $ret[]=$node->nodeValue;

        return $ret;
    }
}

function strip_javascript($filter){ 

    // realign javascript href to onclick 
    $filter = preg_replace("/href=(['\"]).*?javascript:(.*)?\\1/i", "onclick=' $2 '", $filter);

    //remove javascript from tags 
    while( preg_match("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // dump expressions from contibuted content 
    $filter = preg_replace("/:expression\(.*?((?>[^(.*?)]+)|(?R)).*?\)\)/i", "", $filter); 
    $filter = preg_replace("/<iframe.*?>/", "", $filter);
    $filter = preg_replace("/<\/iframe>/", "", $filter);

    while( preg_match("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // remove all on* events    
    while( preg_match("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", $filter, $match) ) {
        $filter = preg_replace("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", "<$1$3>", $filter); 
    }

    return $filter; 
}

function html2a ( $html ) {
  ini_set('pcre.backtrack_limit', 10000);
  ini_set('pcre.recursion_limit', 10000);

  if ( !preg_match_all( '@\<\s*?(\w+)((?:\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?)\>((?:(?>[^\<]*)|(?R))*)\<\/\s*?\\1(?:\b[^\>]*)?\>|\<\s*(\w+)(\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?\/?\>@uxis', $html = trim($html), $m, PREG_OFFSET_CAPTURE | PREG_SET_ORDER) )
    return $html;
  $i = 0;
  $ret = array();
  foreach ($m as $set) {
    if ( strlen( $val = trim( substr($html, $i, $set[0][1] - $i) ) ) )
      $ret[] = $val;
    $val = $set[1][1] < 0 
      ? array( 'tag' => strtolower($set[4][0]) )
      : array( 'tag' => strtolower($set[1][0]), 'val' => html2a($set[3][0]) );
    if ( preg_match_all( '/(\w+)\s*(?:=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+)))?/usix', isset($set[5]) && $set[2][1] < 0 ? $set[5][0] : $set[2][0],$attrs, PREG_SET_ORDER ) ) {
      foreach ($attrs as $a) {
        $val['attr'][$a[1]]=$a[count($a)-1];
      }
    }
    $ret[] = $val;
    $i = $set[0][1]+strlen( $set[0][0] );
  }
  $l = strlen($html);
  if ( $i < $l )
    if ( strlen( $val = trim( substr( $html, $i, $l - $i ) ) ) )
      $ret[] = $val;
  return $ret;
}

function a2html ( $a, $in = "" ) {
  if ( is_array($a) ) {
    $s = "";
    foreach ($a as $t)
      if ( is_array($t) ) {
        $attrs=""; 
        if ( isset($t['attr']) )
          foreach( $t['attr'] as $k => $v )
            $attrs.=" ${k}=".( strpos( $v, '"' )!==false ? "'$v'" : "\"$v\"" );
        $s.= $in."<".$t['tag'].$attrs.( isset( $t['val'] ) ? ">\n".a2html( $t['val'], $in).$in."</".$t['tag'] : "/" ).">";
      } else
        $s.= $in.$t."";
  } else {
    $s = empty($a) ? "" : $in.$a."";
  }
  return $s;
}

function remove_unclosed(&$a, $allowunclosed) {
    if(!is_array($a)) return;

    foreach($a as $k=>$tag) {
        if(is_array($tag)) {
            if(!isset($tag["val"]) && !in_array($tag["tag"],$allowunclosed)) {
                //var_dump($tag["tag"]);
                unset($a[$k]);
            } elseif(is_array(@$tag["val"]))
                remove_unclosed($a[$k]["val"], $allowunclosed);
        }
    }
}

function fix_html($html, $allowunclosed=array("br")) {
    $a = html2a($html);
    remove_unclosed($a, $allowunclosed);
    return a2html($a);
}

function strip_tags_ex($str,$allowtags) { 
    $strs=explode('<',$str); 
    $res=$strs[0]; 
    for($i=1;$i<count($strs);$i++) 
    { 
        if(!strpos($strs[$i],'>')) 
            $res = $res.'&lt;'.$strs[$i]; 
        else 
            $res = $res.'<'.$strs[$i]; 
    } 
    return strip_tags($res,$allowtags);    
}

function strip_tags_attributes($string,$allowtags=allowedtags,$allowattributes=allowedattributes){
    $string=strip_javascript($string);

    $string = strip_tags_ex($string,$allowtags); 

    if (!is_null($allowattributes)) { 
        if(!is_array($allowattributes)) 
            $allowattributes = explode(",",$allowattributes); 
        if(is_array($allowattributes)) 
            $allowattributes = implode(")(?<!",$allowattributes); 
        if (strlen($allowattributes) > 0) 
            $allowattributes = "(?<!".$allowattributes.")"; 
        $string = preg_replace_callback("/<[^>]*>/i",create_function( 
            '$matches', 
            'return preg_replace("/ [^ =]*'.$allowattributes.'=(\"[^\"]*\"|\'[^\']*\')/i", "", $matches[0]);'    
        ),$string); 
    } 
    return $string; 
}

我找到了 strip_javascript http://www.php.net/manual/en/function.strip-tags.php#89453 的源代码我不知道为什么它已经不在代码中了。可能是因为没有名字，没有电子邮件，没有身份可以参考。

【讨论】：

【解决方案4】：

$allowedTags = array( 'div' => true, 'p' => true, 'span' => true, 'b' => true,
    'ul' => true, 'ol' => true, 'li' => true, 'blockquot' => true, 'em' => true, 'br' => true );

$allowedStyles = array( 'font-weight: bold' => true, 'font-style: italic' => true, 'text-decoration: underline' => true );

$allowedAttribs = array( 'align' => true );

$doc = new DOMDocument();
$doc->loadXML( '<doc><p style="font-weight: bold">test</p> <b align="left">asdfasd faksd</b><script>asdfasd</script></doc>' );

sanitizeNodeChildren( $doc->documentElement );

echo htmlentities( $doc->saveXml() );

function sanitizeNodeChildren( $parentNode ) {
    $node = $parentNode->firstChild;
    while( $node ) {
        if( !sanitizeNode( $node ) ) {
            $nodeToDelete = $node;
            $node = $node->nextSibling;
            $parentNode->removeChild( $nodeToDelete );
        } else {
            sanitizeNodeChildren( $node );
            $node = $node->nextSibling;
        }
    }
}

function sanitizeNode( $node ) {
    global $allowedTags, $allowedStyles, $allowedAttribs;
    if( $node->nodeType == XML_ELEMENT_NODE ) {
        if( !isset( $allowedTags[ $node->tagName ] ) ) return false;

        foreach( $node->attributes as $name => $attrNode ) {
            if( $name == 'style' ) {
                if( isset( $allowedStyles[ $attrNode->nodeValue ] ) ) continue;
            }
            if( isset( $allowedAttribs[ $name ] ) ) continue;
            $node->removeAttribute( $name );
        }
    }

    return true;
}

【讨论】：