用 re.sub 替换命名的捕获组答案

【问题标题】：Replacing named capturing groups with re.sub用 re.sub 替换命名的捕获组
【发布时间】：2015-02-22 01:57:53
【问题描述】：

我想替换字符串中匹配的 re 模式的文本，可以使用 re.sub() 来完成。如果我在调用中将函数作为 repl 参数传递给它，它会按需要工作，如下所示：

from __future__ import print_function
import re

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'

my_str = "Here's some <first>sample stuff</first> in the " \
            "<second>middle</second> of some other text."

def replace(m):
    return ''.join(map(lambda v: v if v else '',
                        map(m.group, ('text', 'content'))))

cleaned = re.sub(pattern, replace, my_str)
print('cleaned: {!r}'.format(cleaned))

输出：

cleaned: "Here's some sample stuff in the middle of some other text."

但是从文档看来，我应该能够通过将替换字符串传递给它并引用其中的命名组来获得相同的结果。但是我尝试这样做并没有奏效，因为有时一个组是不匹配的，并且为它返回的值是None（而不是一个空字符串''）。

cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
print('cleaned: {!r}'.format(cleaned))

输出：

Traceback (most recent call last):
  File "test_resub.py", line 21, in <module>
    cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
  File "C:\Python\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Python\lib\re.py", line 278, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python\lib\sre_parse.py", line 802, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

我做错了什么或不理解？

【问题讨论】：

最后一场比赛的content是None...
@KennyTM：我知道一些匹配组将是None，这就是我在replace() 函数中使用lambda v: v if v else '' 的原因。替换字符串中是否需要类似的东西，如果需要，它是如何完成的？

标签： python regex substitution

【解决方案1】：

def repl(matchobj):
    if matchobj.group(3):
        return matchobj.group(1)+matchobj.group(3)
    else:
        return matchobj.group(1)

my_str = "Here's some <first>sample stuff</first> in the " \
        "<second>middle</second> of some other text."

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'
print re.sub(pattern, repl, my_str)

您可以使用re.sub的调用功能。

编辑： cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str) 这将不起作用，因为当字符串的最后一位匹配时，即 of some other text. 定义了 \g<text> 但没有 \g<content> 因为没有内容。但你仍然要求 re.sub 这样做。所以它生成错误。如果您使用字符串"Here's some <first>sample stuff</first> in the <second>middle</second>"，那么您的print re.sub(pattern,r"\g<text>\g<content>", my_str) 将作为\g<content> 一直在这里定义。

【讨论】：

我知道你可以将函数传递给re.sub()——这就是我问题中的第一位代码所做的。我想知道如何通过传递包含对命名（或编号）组的引用的替换字符串来做。
所以你的意思是当模式中的一个组不匹配时没有办法处理——即使整个模式是匹配的，因为它允许组出现零次或多次） -- 除了传递re.sub() 一个函数？我希望这不是真的，并且存在引用命名捕获组的某种条件形式。
@martineau 这就是为什么我们可以在这种情况下使用函数
对我来说听起来像是一个主要缺点——允许用户提供函数只是一种解决方法。我很想接受您编辑的答案，但会先等待一段时间，看看是否还有其他答案。
不。有一个名为 regex 的 pypi 模块为此类组提供值 '' 而不是 None - 就像 Perl 和 PCRE 一样 - 不幸的是 Python 的 re 模块没有那个标志......我猜我已使用参数的函数版本。

【解决方案2】：

如果我理解正确，您想删除 < > 之间的所有内容：

>>> import re

>>> my_str = "Here's some <first>sample stuff</first> in the <second>middle</second> of some other text."

>>> print re.sub(r'<.*?>', '', my_str)

Here's some sample stuff in the middle of some other text.

稍微解释一下这里发生了什么......r'<.*?>'：

< 找到第一个 <

. 然后接受任何字符

* 接受任意字符任意次数

? 将结果限制为尽可能短，如果没有这个，它将一直持续到最后一个 > 而不是第一个可用的

>找到结束点>

然后，将这两点之间的所有内容替换为空。

【讨论】：

我想删除所有<xxx>s 和相应的</xxx>s，就像什么都没有一样，但保留其他所有内容，包括它们之间的任何内容。
我的子建议应该做到这一点...如果它缺少某些内容，请发布您正在使用的完整字符串，以便我们进一步提供帮助。
整个字符串在我的问题中，它在my_str 变量中。我在我的正则表达式中命名了捕获组，并希望在传递给re.sub() 的替换字符串中引用它们。您的子建议没有这样做，因此似乎没有帮助，因为它没有回答我提出的问题，即如何在替换字符串中无错误地引用它们。