谁能解释这个正则表达式答案

【问题标题】：Could anyone explain this regex谁能解释这个正则表达式
【发布时间】：2012-10-04 07:31:16
【问题描述】：

我只需要有人纠正我对这个正则表达式的理解，这就像匹配 HTML 标签的权宜之计。

< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >

我的理解-

< - 匹配标签打开符号
(?: - 无法理解这里发生了什么。这些符号是什么意思？
"[^"]*['"]* 双引号中的任意字符串。还有其他事情吗？
'[^']*'['"]* - 单引号中的一些字符串
[^'">] - ' " > 以外的任何字符。

所以它是一个 ' 的字符串，重复一次或多次后跟一个 '>'。
那是我能做出来的最好的了。

【问题讨论】：

我认为您的理解看起来不错。但是对于所有正则表达式，您应该让自己成为一个“常规表达式测试员”并检查一些场景以确保（我使用了一个可以完成这项工作的 firefox 插件）。

标签： regex html-parsing

【解决方案1】：

<       # literally just an opening tag followed by a space
(       # the bracket opens a subpattern, it's necessary as a boundary for
        # the | later on
?:      # makes the just opened subpattern non-capturing (so you can't access it
        # as a separate match later
"       # literally "
[^"]    # any character but " (this is called a character class)
*       # arbitrarily many of those (as much as possible)
"       # literally "
['"]    # either ' or "
*       # arbitrarily many of those (and possible alternating! it doesn't have
        # to be the same character for the whole string)
|       # OR
'       # literral *
[^']    # any character but ' (this is called a character class)
*       # arbitrarily many of those (as much as possible)
'       # literally "
['"]*   # as above
|       # OR
[^'">]  # any character but ', ", >
)       # closes the subpattern
+       # arbitrarily many repetitions but at least once
>       # closing tag

请注意，正则表达式中的所有空格都被视为与任何其他字符一样。它们恰好匹配一个空格。

还要特别注意字符类开头的^。它不被视为单独的字符，而是反转整个字符类。

我也可以（强制）添加，正则表达式are not appropriate to parse HTML.

【讨论】：

感谢您的出色回答，非捕获子模式...谷歌搜索
这可能是个好主意。当您想从较大的结构中提取数据或者您需要替换这些结构但将数据保留在其中（使用正则表达式）时，这是一个非常强大的概念。
还有一件事我无法理解....模式"[^"]*" ['"]* 应该匹配“这里有一些随机的东西”，但为什么最后有['"]*？ * 适用于整个表达式还是仅适用于字符集 ['"] ？
只适用于字符类['"]。我不太确定它的目的是什么，因为这些字符已经被替换的第三部分（第二个 | 之后的部分）处理了。另请注意，此正则表达式不匹配自关闭标签，因为它们在关闭 > 前面没有空格。
我在您在上述答案中链接的帖子中的一个答案中找到了这个正则表达式。由于我刚刚开始使用正则表达式，因此我需要对其进行澄清。既然我们都被难住了，我想它一定是为了处理一些我们想不到的晦涩的角落案例。

【解决方案2】：

用|s 拆分它，表示ors：

<
  (?:
    "[^"]*" ['"]* |
    '[^']*'['"]* |
    [^'">]
  )+
>

(?: 表示不匹配的组。该组的内部匹配这些东西（按此顺序）：

"stuff"
'stuff'
asd=

实际上，这是一个尝试将 HTML 标记与属性匹配的正则表达式。

【讨论】：

【解决方案3】：

这是 YAPE::Regex::Explain 的结果

(?-imsx:< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <                        '< '
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
     "                       ' "'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '" '
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
     '                       ' \''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^'">]                   any character except: ''', '"', '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
   >                       ' >'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

【讨论】：