如何使用python正则表达式查找和替换句子中第n次出现的单词？答案

【问题标题】：How to find and replace nth occurrence of word in a sentence using python regular expression?如何使用python正则表达式查找和替换句子中第n次出现的单词？
【发布时间】：2015-02-19 18:19:25
【问题描述】：

仅使用python正则表达式，如何查找和替换句子中第n次出现的单词？例如：

str = 'cat goose  mouse horse pig cat cow'
new_str = re.sub(r'cat', r'Bull', str)
new_str = re.sub(r'cat', r'Bull', str, 1)
new_str = re.sub(r'cat', r'Bull', str, 2)

我在上面有一个句子，其中“猫”这个词在句子中出现了两次。我希望将第二次出现的“猫”更改为“公牛”，使第一次“猫”字保持不变。我的最后一句话看起来像： “猫鹅鼠马猪牛牛”。在我上面的代码中，我尝试了 3 次不同的时间都无法得到我想要的。

【问题讨论】：

我认为最好拆分字符串，计算cat 的出现次数并返回一个修改后的列表，其中nth 被替换。可能会慢一点，但这可能并不重要，而且肯定会比毛茸茸的正则表达式更具可读性。

标签： python regex

【解决方案1】：

您可以匹配两次出现的“cat”，保留第二次出现之前的所有内容（\1）并添加“Bull”：

new_str = re.sub(r'(cat.*?)cat', r'\1Bull', str, 1)

正如 Avinash Raj 评论所指出的那样，我们只进行一次替换以避免替换“猫”的第四次、第六次等出现（当至少出现四次时）。

如果您想替换 n-th 出现而不是第二次，请使用：

n = 2
new_str = re.sub('(cat.*?){%d}' % (n - 1) + 'cat', r'\1Bull', str, 1)

顺便说一句，您不应该使用 str 作为变量名，因为它是 Python 保留的关键字。

【讨论】：

请注意，op 想要更改第二个。如果输入是 cat cat cat goose mouse cat，你的会失败
那你为什么用str作为变量名呢？
@Avinash Raj：我已经使用（并且不影响）问题中使用的变量。

【解决方案2】：

如下所示使用负前瞻。

>>> s = "cat goose  mouse horse pig cat cow"
>>> re.sub(r'^((?:(?!cat).)*cat(?:(?!cat).)*)cat', r'\1Bull', s)
'cat goose  mouse horse pig Bull cow'

DEMO

^ 断言我们处于起步阶段。
(?:(?!cat).)* 匹配任何字符，但不匹配 cat ，零次或多次。
cat 匹配第一个 cat 子字符串。
(?:(?!cat).)* 匹配任何字符，但不匹配 cat ，零次或多次。
现在，将所有模式包含在一个捕获组中，例如 ((?:(?!cat).)*cat(?:(?!cat).)*)，以便我们以后可以引用这些捕获的字符。
cat 现在匹配第二个 cat 字符串。

或

>>> s = "cat goose  mouse horse pig cat cow"
>>> re.sub(r'^(.*?(cat.*?){1})cat', r'\1Bull', s)
'cat goose  mouse horse pig Bull cow'

更改{} 中的数字以替换字符串cat 的第一次或第二次或第n 次出现

要替换第三次出现的字符串cat，请将2 放在大括号内..

>>> re.sub(r'^(.*?(cat.*?){2})cat', r'\1Bull', "cat goose  mouse horse pig cat foo cat cow")
'cat goose  mouse horse pig cat foo Bull cow'

Play with the above regex on here ...

【讨论】：

你好，这比使用r'(cat.*?)cat'有什么优势？
那么它怎么值得被否决呢？不过，这不是一个错误的答案。
@Pierre：在上面写下我的评论。由于你们俩都使用.，所以据我所知应该没有区别。
@AvinashRaj：人们可以投反对票，因为这是一个过于复杂的答案。（反对票不是我的，顺便说一句）。
如果 n = 1 并且第一个 'cat' 前面有字符，则此解决方案不起作用：regex101.com/r/wP7pR2/32

【解决方案3】：

这是一种无需正则表达式的方法：

def replaceNth(s, source, target, n):
    inds = [i for i in range(len(s) - len(source)+1) if s[i:i+len(source)]==source]
    if len(inds) < n:
        return  # or maybe raise an error
    s = list(s)  # can't assign to string slices. So, let's listify
    s[inds[n-1]:inds[n-1]+len(source)] = target  # do n-1 because we start from the first occurrence of the string, not the 0-th
    return ''.join(s)

用法：

In [278]: s
Out[278]: 'cat goose  mouse horse pig cat cow'

In [279]: replaceNth(s, 'cat', 'Bull', 2)
Out[279]: 'cat goose  mouse horse pig Bull cow'

In [280]: print(replaceNth(s, 'cat', 'Bull', 3))
None

【讨论】：

这是唯一适用于我的情况的答案。

【解决方案4】：

我会定义一个适用于每个正则表达式的函数：

import re

def replace_ith_instance(string, pattern, new_str, i = None, pattern_flags = 0):
    # If i is None - replacing last occurrence
    match_obj = re.finditer(r'{0}'.format(pattern), string, flags = pattern_flags)
    matches = [item for item in match_obj]
    if i == None:
        i = len(matches)
    if len(matches) == 0 or len(matches) < i:
        return string
    match = matches[i - 1]
    match_start_index = match.start()
    match_len = len(match.group())

    return '{0}{1}{2}'.format(string[0:match_start_index], new_str, string[match_start_index + match_len:])

一个工作示例：

str = 'cat goose  mouse horse pig cat cow'
ns = replace_ith_instance(str, 'cat', 'Bull', 2)
print(ns)

输出：

cat goose  mouse horse pig Bull cow

另一个例子：

str2 = 'abc abc def abc abc'
ns = replace_ith_instance(str2, 'abc\s*abc', '666')
print(ns)

输出：

abc abc def 666

【讨论】：

【解决方案5】：

创建一个 repl 函数以传递给re.sub()。除了......诀窍是让它成为一个类，这样你就可以跟踪调用计数。

class ReplWrapper(object):
    def __init__(self, replacement, occurrence):
        self.count = 0
        self.replacement = replacement
        self.occurrence = occurrence
    def repl(self, match):
        self.count += 1
        if self.occurrence == 0 or self.occurrence == self.count:
            return match.expand(self.replacement)
        else:
            try:
                return match.group(0)
            except IndexError:
                return match.group(0)

然后像这样使用它：

myrepl = ReplWrapper(r'Bull', 0) # replaces all instances in a string
new_str = re.sub(r'cat', myrepl.repl, str)

myrepl = ReplWrapper(r'Bull', 1) # replaces 1st instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)

myrepl = ReplWrapper(r'Bull', 2) # replaces 2nd instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)

我确信有一种更聪明的方法可以避免使用类，但这似乎足以解释清楚。此外，请务必返回 match.expand()，因为仅返回替换值在技术上并不正确，因为有人决定使用 \1 类型模板。

【讨论】：

【解决方案6】：

我使用了简单的函数，它列出了所有出现的事件，选择第 n 个位置并使用它将原始字符串拆分为两个子字符串。然后它替换第二个子字符串中的第一次出现并将子字符串连接回新字符串：

import re

def replacenth(string, sub, wanted, n):
    where = [m.start() for m in re.finditer(sub, string)][n-1]
    before = string[:where]
    after = string[where:]
    newString = before + after.replace(sub, wanted, 1)
    print newString

对于这些变量：

string = 'ababababababababab'
sub = 'ab'
wanted = 'CD'
n = 5

输出：

ababababCDabababab

注意事项：

where 变量实际上是匹配位置的列表，您可以在其中选择第 n 个。但是列表项索引通常以0 开头，而不是1。因此有一个n-1 索引和n 变量是实际的第n 个子字符串。我的示例找到了第 5 个字符串。如果您使用 n 索引并想找到第 5 个位置，则需要将 n 设为 4。您使用的通常取决于生成我们的n 的函数。

这应该是最简单的方法，但它不仅仅是您最初想要的正则表达式。

来源和一些附加链接：

where 建设：How to find all occurrences of a substring?

字符串拆分：https://www.daniweb.com/programming/software-development/threads/452362/replace-nth-occurrence-of-any-sub-string-in-a-string

类似问题：Find the nth occurrence of substring in a string

【讨论】：

谢谢！我认为您需要重新分配如下：after=after.replace(sub, wanted, 1)。我不相信它是在原地改变的。（也是函数定义后的冒号）

【解决方案7】：

如何将nth needle 替换为word：

s.replace(needle,'$$$',n-1).replace(needle,word,1).replace('$$$',needle)

【讨论】：

这个问题（从 2014 年开始）特别要求使用 python 正则表达式，并且有一个用户接受的答案 - 这并没有改善那个答案

【解决方案8】：

我通过相对于整个字符串生成所需捕获模式的“分组”版本来解决此问题，然后将子直接应用于该实例。

父函数是regex_n_sub，它收集与re.sub() 方法相同的输入。

catch 模式 与实例编号一起传递给get_nsubcatch_catch_pattern()。在内部，列表推导式生成模式 '.*? 的倍数（匹配任何字符，0 次或多次重复，非贪婪）。此模式将用于表示第 n 次出现 catch_pattern 之间的空间。

接下来，输入 catch_pattern 被放置在每个“空间模式”的第 n 个之间，并用括号括起来以形成 第一组。

第二组只是括在括号中的 catch_pattern - 所以当两个组组合在一起时，'直到第 n 次出现 catch 模式的所有文本的模式 被创建。这个“new_catch_pattern”内置了两个组，因此可以替换包含第 n 次出现的 catch_pattern 的第二组。

替换模式 被传递给get_nsubcatch_replace_pattern() 并与前缀r'\g<1>' 组合形成模式\g<1> + replace_pattern。此模式的 \g<1> 部分从 catch 模式中定位第 1 组，并用替换模式中的文本替换该组。

下面的代码是冗长的，只是为了更清楚地理解流程；可以根据需要减少。

--

下面的示例应该独立运行，并将“I”的第四个实例更正为“me”：

“当我一个人去公园时，我想鸭子会嘲笑我，但我不确定。”

与

“当我一个人去公园时，我觉得鸭子会笑话我，但我不确定。”

import regex as re

def regex_n_sub(catch_pattern, replace_pattern, input_string, n, flags=0):
    new_catch_pattern, new_replace_pattern = generate_n_sub_patterns(catch_pattern, replace_pattern, n)
    return_string = re.sub(new_catch_pattern, new_replace_pattern, input_string, 1, flags)
    return return_string

def generate_n_sub_patterns(catch_pattern, replace_pattern, n):
    new_catch_pattern = get_nsubcatch_catch_pattern(catch_pattern, n)
    new_replace_pattern = get_nsubcatch_replace_pattern(replace_pattern, n)
    return new_catch_pattern, new_replace_pattern

def get_nsubcatch_catch_pattern(catch_pattern, n):
    space_string = '.*?'
    space_list = [space_string for i in range(n)]
    first_group = catch_pattern.join(space_list)
    first_group = first_group.join('()')
    second_group = catch_pattern.join('()')
    new_catch_pattern = first_group + second_group
    return new_catch_pattern

def get_nsubcatch_replace_pattern(replace_pattern, n):
    new_replace_pattern = r'\g<1>' + replace_pattern
    return new_replace_pattern


### use test ###
catch_pattern = 'I'
replace_pattern = 'me'
test_string = "When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."

regex_n_sub(catch_pattern, replace_pattern, test_string, 4)

此代码可以直接复制到工作流中，并将替换的对象返回给regex_n_sub() 函数调用。

如果实施失败请告诉我！

谢谢！

【讨论】：

【解决方案9】：

仅仅因为当前的答案都不符合我的需要：基于 aleskva 的答案：

import re

def replacenth(string, pattern, replacement, n):
    assert n != 0
    matches = list(re.finditer(pattern, string))
    if len(matches) < abs(n) :
        return string
    m = matches[ n-1 if n > 0 else len(matches) + n] 
    return string[0:m.start()] + replacement + string[m.end():]

它接受负匹配数字（n = -1 将返回最后一个匹配）、任何正则表达式模式，而且它很有效。如果匹配的数量少于n，则返回原始字符串。

【讨论】：