重复捕获组与使用嵌套模式捕获重复组答案

【问题标题】：Repeating a capturing group vs capturing a repeated group with nested pattern重复捕获组与使用嵌套模式捕获重复组
【发布时间】：2015-04-13 13:01:31
【问题描述】：

我正在尝试使用正则表达式并遇到以下问题。

说，我有以batman 开头和结尾的行，中间有一些任意数字，我想要捕获组中的数字以及单词batman。

batman 12345 batman
batman 234 batman
batman 35655 batman
batman 1311 batman

这很容易实现（简单的一个 => (\s*batman (\d+) batman\s*) DEMO）。

现在我尝试了更多.. 将相同的数据放入 capture tag (#capture)

#capture
batman 12345 batman
batman 234 batman
batman 35655 batman
batman 1311 batman
#capture

#others
batman 12345 batman
batman 234 batman
batman 35655 batman
batman 1311 batman
#others

我试图只捕获#capture 之间的行，我尝试过

(?:#capture)(\s*batman (\d+) batman\s*)*(?:#capture)

匹配模式但仅包含捕获组中的最后一次迭代，即$1=>batman $2=>1311 $1=>batmanDEMO

我还尝试使用

捕获重复组

(?:#capture)((\s*batman (\d+) batman\s*)*)(?:#capture)

这个捕获了所有内容.. 但在不同的组中.. DEMO

有人可以帮我理解和解决这个问题吗？

预期结果：仅捕获#capture 中的组和组中的所有数字，以便轻松替换。

谢谢。

【问题讨论】：

你说你想要任何语言，但你使用了只有 C♯ 支持的东西。这不是命名/执行捕获的标准方式。
抱歉......我对逻辑更感兴趣......我将问题更新为特定于语言的问题。
哦，我明白了。请分别提供原始输入字符串和所需的输出结果，因为仍然不清楚您想要什么。您将无法在一场比赛中仅捕获所有数字，因为它们是不连续的。您将需要更多的程序逻辑来解决这个问题。如果您涉及换行符，您还必须使用(?s) 模式。你可以做通行证，一个得到/#capture((?s:(?!#capture).)*)#capture/，然后另一个得到第一个得到的所有/\b(\d+)\b/匹配。如果您需要更多 batman 约束，那么您可以添加这些。
@tchrist 这是一个很好的解决方法.. 在应用程序中使用时我一定会牢记这一点。

标签： php regex

【解决方案1】：

您可以在 .NET 正则表达式风格中利用非固定宽度的look-behind，并使用此正则表达式：

(?s)(?<=#capture.*?)(?:batman (\d+) batman)(?=.*?#capture)

但是，此示例适用于您提供的情况（例如，如果文本中有更多 #capture...#capture 块，它将不起作用），您只需要根据标签上下文添加更多限制。

在 PCRE/Perl 中，您可以通过声明我们要跳过的内容来获得类似的结果：

(?(DEFINE)                          # Definitions
    (?<skip>\#others.*?\#others)    # What we should skip
)
(?&skip)(*SKIP)(*FAIL)              # Skip it
|
(?<needle>batman\s+(\d+)\s+batman)  # Match it

例如，替换为batman new-$3 batman。

看到这个demo on regex101。

【讨论】：

我们可以用简单的正则表达式模式达到同样的效果吗..？
@karthikmanchala 为什么要这些数字？您是想通过替换/更改来更改数据，还是只想将它们全部拉出并用它们做一些不影响原始数据的其他事情？现在我认为你的问题是线性处理。在Perl中使用while (<FH>) { ... }风格的逐行处理，就是if (/#capture/ ... /#capture/) { push @numbers, /^batman\s+(\d+)\s+batman$/; }。
@tchrist 一旦我在不同的组中捕获数字..我应该能够做到这两点..不是吗？如果是的话，我对两者都感兴趣..
@karthikmanchala 不，这不能在单一模式中完成，因为 PCRE 不支持单一模式中的不连续匹配。您需要辅助逻辑来存储不连续的匹配和逐行处理，更不用说栅栏匹配了。即使使用一种模式可以做到这一点，您几乎肯定不应该这样做，因为这将是一场噩梦。使用简单的模式，不止一种，你会更快乐。

【解决方案2】：

由于 PCRE 无法像 .net 框架或 Python 的新正则表达式模块那样存储重复捕获，因此可以使用\G 功能并在之后检查以确保到达块的末尾。

\G 锚标记上一场比赛结束时的位置，并在全球研究环境中使用（与preg_match_all 或preg_replace* 一起使用）。找到连续的结果很有用。请注意，直到第一个匹配 \G 默认标记字符串的开头。因此，为了防止\G 在字符串开头成功，您需要添加负前瞻(?!\A)。

$pattern = '~
(?:        # two possible branches
    \G(?!\A)       # the contiguous branch
  |
    [#]capture \R  # the start branch: only used for the first match
)
(batman \h+ ([0-9]+) \h+ batman)
\R    # alias for any kind of newlines 
(?: ([#]) (?=capture) )?  # the capture group 3 is used as a flag
                          # to know if the end has been reached.
                          # Note that # is not in the lookahead to
                          # avoid the start branch to succeed
~x';

if (preg_match_all($pattern, $text, $matches) && array_pop($matches[3])) {
    print_r($matches[1]);
}

【讨论】：