Python re.match 需要更长的时间来匹配这个答案

【问题标题】：Python re.match takes much longer time to match thisPython re.match 需要更长的时间来匹配这个
【发布时间】：2015-10-24 06:40:40
【问题描述】：

我需要找到匹配以下模式的输入字符串：

'fe{10,20}.clustera1.example.com'
'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com'

主机名或主机名中的{} 块可以在输入字符串中重复任意次数。

我首先尝试使用 re 模块进行匹配，在某些情况下需要 10-30 秒。例如，如果在输入字符串的末尾添加一个空格，如下所示：

'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com '

这需要很长时间才能完成。

import re
string = 'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com '
print re.match('^([a-z.-]+|{[\d]+(,[\d]+)*})+(,([a-z.-]+|{[\d]+(,[\d]+)*})+)*$', string).group(0)

即使是简化版本（不检查 , 在 {} 块中的正确位置）的行为方式也相同。

print re.match('^([a-z.-]+|{[\d,]+})+(,([a-z.-]+|{[\d,]+})+)*$', string).group(0)

在 Perl 中尝试了相同的正则表达式并使用 Python 正则表达式模块。两者都运行良好且快速。

在这里，两者都不匹配（预期），但运行速度非常快。

echo 'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com ' | \
perl -nle 'print $_ if /^([a-z.-]+|{[\d]+(,[\d]+)*})+(,([a-z.-]+|{[\d]+(,[\d]+)*})+)*$/'

import regex
string = 'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com '
print re.match('^([a-z.-]+|{[\d]+(,[\d]+)*})+(,([a-z.-]+|{[\d]+(,[\d]+)*})+)*$', string).group(0)

我使用的正则表达式模式真的有问题吗？是否可以使用 re 模块本身使其工作？

用于测试的 Python 版本是 2.7.6 和 2.7.8

【问题讨论】：

标签： python regex

【解决方案1】：

您的输入字符串示例有一个尾随空格，但您的正则表达式不允许尾随空格。所以，其中任何一个：

>>> text = 'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com'
>>> re.match('^([a-z.-]+|{[\d,]+})+(,([a-z.-]+|{[\d,]+})+)*$', text)
    <_sre.SRE_Match object; span=(0, 69), match='fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clust>
>>> text = 'fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clusterb{1,8}.example.com '
>>> re.match('^([a-z.-]+|{[\d,]+})+(,([a-z.-]+|{[\d,]+})+)*\s*$', text)
    <_sre.SRE_Match object; span=(0, 70), match='fe{10,20}.clustera{1,2}.example.com,fe{1,5}.clust>

快速匹配。根据您的原始输入，我不确定它是否能找到匹配项——它会根据规则进行详尽搜索，直到用尽所有可能性，然后再找不到匹配项。

给定正则表达式的具体规则是什么？如果你编译带有re.DEBUG标志的正则表达式，你可以查看它们：

>>> re.compile('^([a-z.-]+|{[\d]+(,[\d]+)*})+(,([a-z.-]+|{[\d]+(,[\d]+)*})+)*$', re.DEBUG)
at at_beginning
max_repeat 1 4294967295
  subpattern 1
    branch
      max_repeat 1 4294967295
        in
          range (97, 122)
          literal 46
          literal 45
    or
      literal 123
      max_repeat 1 4294967295
        in
          category category_digit
      max_repeat 0 4294967295
        subpattern 2
          literal 44
          max_repeat 1 4294967295
            in
              category category_digit
      literal 125
max_repeat 0 4294967295
  subpattern 3
    literal 44
    max_repeat 1 4294967295
      subpattern 4
        branch
          max_repeat 1 4294967295
            in
              range (97, 122)
              literal 46
              literal 45
        or
          literal 123
          max_repeat 1 4294967295
            in
              category category_digit
          max_repeat 0 4294967295
            subpattern 5
              literal 44
              max_repeat 1 4294967295
                in
                  category category_digit
          literal 125
at at_end
    re.compile(r'^([a-z.-]+|{[\d]+(,[\d]+)*})+(,([a-z.-]+|{[\d]+(,[\d]+)*})+)*$',
re.UNICODE|re.DEBUG)

在上面写着literal <num> 的地方，您可以在 ascii 或 unicode 点表中找到它所转换的内容，例如在asciitable.com 中找到的那个。

如果您可以看到这里有两个巨大的循环，第一个max_repeat 和第二个max_repeat，每个都包含许多子循环/搜索。正则表达式引擎正在搜索它的排列以尝试找到匹配项。如果你能稍微推理一下re.DEBUG返回的操作规则，它可以帮助你理解正则表达式引擎可能在做什么。

【讨论】：

所有信息都很好，但我认为他看到的性能影响完全是非线性的。在 re 中有几个类似的边缘情况，这就是他们正在研究替换模块的原因。
感谢您详细解释为什么需要太长时间。但我仍然想知道为什么它会运行这么大的循环并遇到麻烦。
@meharo - “循环”为它提供了一个搜索空间以找到匹配项。对于您的正则表达式/输入文本，我认为没有匹配项（我已经让该正则表达式和该输入文本运行超过 45 分钟，但它仍然没有意识到没有匹配项）。正则表达式引擎肯定不知道没有匹配项，直到它尝试了您的模式允许的所有可能性，并且在给定输入的情况下无法以某种方式修剪为不可能的。

【解决方案2】：

re 中有一些明确的性能错误。在模式的末尾添加“$”会加剧这一特殊情况。如果删除它，那么匹配将很快完成，然后您可以手动确定它是否一直到达行/字符串的末尾。

如果您有时间，您可能想要获取最新的 Python 测试版，并确保存在错误并报告它。 I reported one 不久前，他们做得更好。

【讨论】：