模式的意外结束：Python 正则表达式答案

【问题标题】：Unexpected end of Pattern : Python Regex模式的意外结束：Python 正则表达式
【发布时间】：2011-07-20 20:17:08
【问题描述】：

当我使用以下 python 正则表达式执行下述功能时，我收到错误 Unexpected end of Pattern。

正则表达式：

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

此正则表达式的目的：

输入：

CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

应该匹配：

CODE876
CODE223
CODE657
CODE697

并将出现的地方替换为

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743

不应匹配：

code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665

最终输出

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

编辑和更新 1

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

错误不再发生。但这与需要的任何模式都不匹配。匹配组或匹配本身是否存在问题。因为当我这样编译这个正则表达式时，我的输入不匹配。

编辑和更新 2

f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()

s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)
print s1

输入

CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345

CODE234

CODE333

输出

<a href="http://productcode/CODE123">CODE123</a> <a href="http://productcode/CODE765">CODE765</a> testing1<a href="http://productcode/CODE123">CODE123</a> example1<a href="http://productcode/CODE345">CODE345</a> http://www.coding.com/<a href="http://productcode/CODE333">CODE333</a> <a href="http://productcode/CODE345">CODE345</a>

<a href="http://productcode/CODE234">CODE234</a>

<a href="http://productcode/CODE333">CODE333</a>

正则表达式适用于原始输入，但不适用于来自文本文件的字符串输入。

更多结果请参见输入 4 和 5 http://ideone.com/3w1E3

【问题讨论】：

关于不processing HTML/XHTML/XML with regular expressions的一般免责声明。
code876 应该怎么做？ CODE8765?
@thinkcool：编辑您的问题以包含 code876 和 CODE8765 示例。注意：您的模式不会尝试限制 CODE 之后的位数。同样按照建议，使用 re.VERBOSE 以便您自己更好地了解它在做什么。
@thinkcool: CODE69743 在所需的输出中但不在输入中
@thinkcool: CODE123XYZ 的输入怎么办？

标签： python regex pattern-matching

【解决方案1】：

您的主要问题是 (?-i) 东西，就 Python 2.7 和 3.2 而言，这是一厢情愿的想法。有关详细信息，请参见下文。

import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy

看起来建议被置若罔闻...这是 re.VERBOSE 格式的模式：

pattern4 = r'''
    ^
    (?i)
    (
        (?:
            (?!http://)
            (?!testing[0-9])
            (?!example[0-9])
            . #### what is this for?
        )*?
    ) ##### end of capturing group 1
    (CODE[0-9]{3}) #### not in capturing group 1
    (?!</a>)
    '''

【讨论】：

@thinkcool：此答案正确回答了您提出的问题。不值得至少投票吗？
我使用你发布的正则表达式以及 'code' prog = re.compile(pattern4,re.VERBOSE) result = prog.match(mytext) print result 'code' 我做对了吗方式。我的输入没有匹配项
@thinkcool：我发布的正则表达式 pattern4 在功能上与您的相同，并且 cmets 暗示了它不起作用的原因。邀请您尝试自己解决问题。
我会尽力回复你..谢谢

【解决方案2】：

好的，看起来问题出在(?-i)，这很令人惊讶。 inline-modifier 语法的目的是让您将修饰符应用于正则表达式的选定部分。至少，这就是它们在大多数口味中的工作方式。在 Python 中，它们似乎总是修改整个正则表达式，与外部标志（re.I、re.M 等）相同。替代的(?i:xyz) 语法也不起作用。

顺便说一句，我认为没有任何理由使用三个单独的前瞻，就像您在此处所做的那样：

(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?

将它们组合在一起：

(?:(?!http://|testing[0-9]|example[0-9]).)*?

编辑：我们似乎已经从正则表达式为什么抛出异常的问题转向了它为什么不起作用的问题。我不确定我是否理解您的要求，但下面的正则表达式和替换字符串会返回您想要的结果。

s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)

see it in action one ideone.com

这就是你所追求的吗？

编辑：我们现在知道替换是在更大的文本中完成的，而不是在独立的字符串上。这使问题变得更加困难，但我们也知道完整的 URL（以http:// 开头的那些）只出现在已经存在的锚元素中。这意味着我们可以将正则表达式拆分为两种选择：一种匹配完整的 <a>...</a> 元素，另一种匹配我们的目标字符串。

(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))

诀窍是使用函数而不是静态字符串进行替换。每当正则表达式匹配一个锚元素时，该函数都会在 group(1) 中找到它并原封不动地返回它。否则，它使用 group(2) 和 group(3) 来构建一个新的。

here's another demo（我知道那是可怕的代码，但我现在太累了，无法学习更 Python 的方式。）

【讨论】：

当我尝试这个正则表达式时，正则表达式匹配所有包含 CODE[0-9]{3} 的字符串并将它们替换为 http://productcode/CODE[0-9]{3}。我的匹配组有问题
我更新了我的答案以包含完整的解决方案。让我知道你的想法。
非常感谢..这对我有用。但是，由于我正在解析文本文件的内容，因此我将该文本文件的内容读取为字符串并使用此正则表达式。它将所有出现的 CODE[0-9]{3} 替换为 productcode/CODE[0-9]{3}。它不处理特殊情况。
这不处理 OP 的 CODE69743 案例。
@thinkcool：你说它对你有用，但它不处理“特殊情况”？有什么特殊情况？？它适用于所有测试用例，除了 CODE 后跟 3 位以上的数字。

【解决方案3】：

我看到的唯一问题是您替换使用了错误的捕获组。

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)  
                       ^                                                        ^                                                        ^
                    first capturing group                                  second one                                         using the first group

在这里，我将第一个也设为非捕获组

^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)

看here on Regexr

【讨论】：

当我使用你修改过的正则表达式时，我得到了同样的错误。python 使用的正则表达式引擎与 RegExr 的引擎不同吗
我按照 John 的说法删除了 (?-i)，我不再收到错误消息，但我无法匹配此正则表达式。

【解决方案4】：

对于复杂的正则表达式，使用re.X flag 记录您正在执行的操作并确保括号正确匹配（即使用缩进来指示当前的嵌套级别）。

【讨论】：