正则表达式拆分 HTML 标签答案

【问题标题】：Regex to split HTML tags正则表达式拆分 HTML 标签
【发布时间】：2011-05-01 23:47:12
【问题描述】：

我有一个这样的 HTML 字符串：

<img src="http://foo"><img src="http://bar">

将它分成两个独立的 img 标签的正则表达式模式是什么？

【问题讨论】：

它们已经是 2 个独立的标签了
已经是两个独立的img标签了。
请搜索类似问题。有很多。除非您有非常小的、特定的和模式化的输入，否则切勿将 RegEx 用于 HTML。
不是每个计算问题都最好用正则表达式解决。
您的问题的字面答案是split /(?<=>)(?=<)/，但如果这确实是您正在寻找的答案，我几乎可以保证您正在做某事非常错了。

标签： regex

【解决方案1】：

Don't do it with regex。使用 HTML/XML 解析器。你甚至可以先通过 Tidy 运行它来清理它。大多数语言都有一个整洁的库。你用的是什么语言？

【讨论】：

【解决方案2】：

这样就可以了：

<img\s+src=\"[^\"]*?\">

或者您可以这样做以考虑任何其他属性

<img\s+[^>]*?\bsrc=\"[^\"]*?\"[^>]*>

【讨论】：

这并没有考虑到您所说的“附加属性”。查看我的解决方案，了解如何正确执行此操作。好吧，如果不使用 HTML 解析类，则尽可能正确。
我实际上是在寻找一种快速而肮脏的解决方案来获取字符串中 img 标签的所有 src 属性值并遇到了这个答案，这非常有帮助，对于我来说，我只需要添加两个括号：<img\s+[^>]*?\bsrc=\"([^\"]*?)\"[^>]*>

【解决方案3】：

<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">

PHP 示例：

$prom = '<img src="http://foo"><img src="http://bar">';

preg_match_all('|<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">|',$prom, $matches);

print_r($matches[0]);

【讨论】：

【解决方案4】：

你有多确定你的字符串是正是那个？像这样的输入呢：

<img alt=">"          src="http://foo"  >
<img src='http://bar' alt='<'           >

这是什么编程语言？是否有某些原因您没有使用标准的 HTML 解析类来处理这个问题？只有当您有一组非常知名的输入时，正则表达式才是一种好方法。它们不适用于真正的 HTML，仅适用于被操纵的演示。

即使您必须使用正则表达式，您也应该使用正确的语法。这很容易。我已经在无数网页上测试了以下程序。它会处理我上面概述的案例 - 以及其他一两个案例。

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

$/ = undef;
$_ = <>;   # read all input

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ <script> .*?  </script> }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

my $count = 0;

while (/$img_rx/g) {
    printf "Match %d at %d: %s\n", 
            ++$count, pos(), $+{TAG};
}

给你。没什么！

哎呀，你为什么曾经想要使用 HTML 解析类，考虑到在正则表达式中处理 HTML 是多么容易。 ☺

【讨论】：

【解决方案5】：

一种稍微疯狂/聪明/奇怪的方法是在 >

$string = '<img src="http://foo"><img src="http://bar">';
$KimKardashian = split("><",$string);
$First = $KimKardashian[0] . '>';
$Second = '<' . $KimKardashian[1];

【讨论】：