使用php过滤无法识别的标签[重复]答案

【问题标题】：Filtering unrecognized tags using php [duplicate]使用php过滤无法识别的标签[重复]
【发布时间】：2013-11-18 14:55:46
【问题描述】：

我需要从字符串中检索未标记的内容。这就是输入的样子。

<!--[recognized]-->This is a recognized tag<!--[/recognized]-->
<!--[unrecognized]-->This is an unrecognized tag<!--[/unrecognized]-->
and this is normal text

拥有已识别标签的列表，我需要一些可爱而简单的方法来破坏“已识别”标签和普通文本，这样我就可以拥有纯粹无法识别的东西。

这就是我现在的做法，但正如您所见，我使用了两个正则表达式。我希望它只是一个。

$recognized_tags    = implode( '|', array( 'recognized', 'foo', 'bar' ) );
$pattern            = '/<!--\[(?<tag>(' . $recognized_tags . '))\]-->(?<tag_content>.*)<!--\[\/\k<tag>\]-->/s';
$parcial_result     = preg_replace( $pattern, '', $text );

preg_match_all( '/<!--\[(?<tag>.+)\]-->(?<tag_content>.*)<!--\[\/\k<tag>\]-->/s', $parcial_result, $matches );
$result = implode( $matches[0] );

那么，你知道我怎么能只使用一个正则表达式来做到这一点吗？请注意，输入字符串可能会有所不同，并且存在多个标签（已识别或未识别）。

非常感谢！

【问题讨论】：

不要尝试为此使用正则表达式。
有自闭标签吗？
标签可以嵌套吗？
我来看看html白名单的东西。是的，有自闭标签（ / 的标签）。是的，标签可以嵌套:(
自闭标签看起来如何？像这样：?

标签： php regex tags

【解决方案1】：

编辑：从无法识别的标签中查找内容：（即将推出）

旧响应：要查找未包含在标签之间的文本，您可以将此模式应用于原始 $text 字符串（之前无需进行任何替换）：

$text = <<<'LOD'
<!--[recognized]-->This is a recognized tag<!--[/recognized]-->
<!--[unrecognized]-->This is an unrecognized tag<!--[/unrecognized]-->
<!--[atag]-->
    <!--[nested1]--> text
        <!--[nested2]-->text<!--[/nested2]-->
    <!--[/nested1]-->
<!--[/atag]-->
and this is normal text
LOD;

$pattern = '~(<!--\[([^]]++)]-->(?>[^<]++|(?1))*+<!--\[/\2]-->)*+\K[^<]++~';
preg_match_all($pattern, $text, $matches);

print_r($matches[0]);

图案细节：

~                       # delimiter
(                       # capturing group 1: will capture all tags with content inside
    <!--\[([^]]++)]-->  # the opening tag: the capturing group 2 contains the name of the tag
    (?>                 # atomic group: all possible content inside tags 
        [^<]++          # all characters except <
      |                 # OR
        (?1)            # an other tag: recursion to the capturing group 1
    )*+                 # repeat zero or more times the atomic group
    <!--\[/\2]-->       # the closing tag with a backreference to the 2nd capturing group
)*+                     # repeat zero or more times the capturing group 1
\K                      # IMPORTANT: the \K resets all the precedent match from match result before itself
[^<]++                  # the result: all characters that are not a <
~

此模式的总体思路是匹配“普通文本”之前的所有潜在标签，然后使用\K 功能从匹配结果中重置这部分。

注意：为避免出现空白结果并修剪前导空格，您可以将其添加到模式中：

$pattern = '~(?>\s++|(<!--\[([^]]++)]-->(?>[^<]++|(?1))*+<!--\[/\2]-->))*+\K[^<]++~';

【讨论】：

哇！能解释一下吗？
@googol：一切都清楚了吗？
是的，正则表达式很清楚，我学到了很多东西！但这并不能满足我的需求:(也许我的问题不清楚......但我真正需要的是从“无法识别”标签中获取内容（那些不在我的列表中的标签）。跨度>