这个递归正则表达式究竟是如何工作的？答案

【问题标题】：How exactly does this recursive regex work?这个递归正则表达式究竟是如何工作的？
【发布时间】：2017-05-10 10:19:27
【问题描述】：

看看这个模式：

(o(?1)?o)

它匹配任何长度为 2ⁿ 且 n ≥ 1 的 o 序列。
It works, see regex101.com（添加字边界以便更好地演示）。
问题是：为什么？

在下文中，字符串的描述（匹配与否）将只是一个粗体数字或描述长度的粗体术语，如 2ⁿ。

分解（添加空格）：

( o (?1)? o )
(           ) # Capture group 1
  o       o   # Matches an o each at the start and the end of the group
              # -> the pattern matches from the outside to the inside.
    (?1)?     # Again the regex of group 1, or nothing.
              # -> Again one 'o' at the start and one at the end. Or nothing.

我不明白为什么这不匹配 2n，而是 2ⁿ，因为我会将模式描述为 *an未定义数量的o o，相互堆叠。

可视化：

没有递归，2是匹配的：

oo

一次递归，4是匹配：

o  o
 oo

到目前为止，很容易。

两次递归。显然是错误的，因为模式不匹配6：

o    o
 o  o
  oo

但是为什么呢？它似乎符合模式。

我的结论是，重复的不仅仅是简单的模式，否则 6 将不得不匹配。

但是根据regular-expressions.info：

(?P<name>[abc])(?1)(?P>name) 匹配三个字母，就像(?P<name>[abc])[abc][abc] 一样。

和

[abc])(?1){3} [...] 等价于([abc])[abc]{3}

所以它似乎只是简单地重新匹配正则表达式代码，而没有关于捕获组的先前匹配的信息。

有人能解释一下为什么这个模式匹配 2ⁿ 而没有别的吗？

编辑：

在cmets中提到过：

我怀疑在自身内部引用捕获组实际上是受支持的情况。

regular-expressions.info does mention the technique:

如果您在它调用的组内进行调用，您将拥有一个递归捕获组。

【问题讨论】：

你正确理解了递归。单词边界在这里让你感到困惑。 Look here, 6 os 匹配得很好。
这很有趣。你是对的，这让我很困惑。就字边界而言，6、8、12 和 16 之间的区别在哪里？我稍后会编辑问题。

标签： regex recursion pcre regex-group regex-recursion

【解决方案1】：

你正确理解了递归。单词边界在这里让你感到困惑。模式周围的\b 要求正则表达式引擎仅匹配前后没有单词字符的字符串。

在这里看看递归是如何进行的：

( o      (?1)?         o )  => oo

(?1) 然后替换为(o(?1)?o)：

( o   (?>o(?1)?o)?     o )  => oo or oooo

再说一遍：

(o (?>o(?>o(?1)?o)?o)?  o) => oo, oooo, oooooo

请参阅regex demo without word boundaries。

为什么要在上面的例子中添加(?>...)？Each recursion level in PHP recursive regexes is atomic，不像Perl，而且一旦上一级失败，引擎就不会返回到下一级一个。

当您添加单词边界时，匹配的第一个 o 和最后一个 o 在之前/之后不能有任何其他单词字符。那么，ooowon't match 那么。

请参阅Recursive Regular Expressions 的逐步解释和Word Boundary: \b rexegg.com。

为什么oooooo 不是整体匹配，而是oooo 和oo？

同样，每个递归级别都是原子的。 oooooo 是这样匹配的：

(o(?1)?o) 匹配第一个 o
(?1)? 得到扩展，模式现在是 (o(?>o(?1)?o)?o)，它与输入中的第二个 o 匹配
直到(o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o)不再匹配输入，回溯发生，我们进入第6级，
整个第 6 个递归级别也失败了，因为它无法匹配os 的必要数量
这个过程一直持续到可以匹配os 的必要数量的水平。

见regex debugger：

【讨论】：

我还是很难理解，为什么 6 os 匹配为 4 + 2，7 os 匹配为 6？
@SebastianProske：检查this debugger - 第一个o（在递归构造的左侧）获取输入字符串中的所有os。然后每个最终的o 必须适应每个深度级别。引擎以这种方式在主子模式内回溯。
@SebastianProske: 这也与each recursion depth is atomic: 因为(?1) 之前的第一个o 匹配字符串中的所有os 的事实有关，所以有最后一个 o 没有地方匹配，因为最后一个递归级别没有更多文本。
谢谢，我终于弄明白了。我已经添加了我采取的步骤作为答案 - 但你的绝对值得我 +1。

【解决方案2】：

这或多或少是 Wiktors 回答的后续 - 即使在删除单词边界之后，我也很难弄清楚为什么 oooooo (6) 匹配为 oooo 和 oo，而 @987654324 @ (7) 匹配为oooooo。

这是它的详细工作原理：

当扩展递归模式时，内部递归是原子的。使用我们的模式，我们可以将其展开为

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)

（在实际模式中，此 get 再次展开，但这不会改变解释）

这是字符串的匹配方式 - 首先是oooooo (6)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)
o   |ooooo                          <- first o gets matched by first atomic group
o   o   |oooo                       <- second o accordingly
o   o   o   |ooo                    <- third o accordingly
o   o   o   o   |oo                 <- fourth o accordingly
o   o   o   o   oo|                 <- fifth/sixth o by the innermost atomic group
                     ^              <- there is no more o to match, so backtracking starts - innermost ag is not matched, cursor positioned after 4th character
o   o   o   o   xx   o   |o         <- fifth o matches, fourth ag is successfully matched (thus no backtracking into it)
o   o   o   o   xx   o   o|         <- sixth o matches, third ag is successfully matched (thus no backtracking into it)
                           ^        <- no more o, backtracking again - third ag can't be backtracked in, so backtracking into second ag (with matching 3rd 0 times)
o   o                      |oo<oo   <- third and fourth o close second and first atomic group -> match returned  (4 os)

现在ooooooo (7)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)    
o   |oooooo                         <- first o gets matched by first atomic group
o   o   |ooooo                      <- second o accordingly
o   o   o   |oooo                   <- third o accordingly
o   o   o   o   |ooo                <- fourth o accordingly
o   o   o   o   oo|o                <- fifth/sixth o by the innermost atomic group
o   o   o   o   oo  o|              <- fourth ag is matched successfully (thus no backtracking into it)
                         ^          <- no more o, so backtracking starts here, no backtracking into fourth ag, try again 3rd
o   o   o                |ooo<o     <- 3rd ag can be closed, as well as second and first -> match returned (6 os)

【讨论】：