替换之前出现的字符串答案

【问题标题】：Replace previous occurrence of string替换之前出现的字符串
【发布时间】：2020-11-30 08:20:25
【问题描述】：

我想删除括号内的重复单词并用“S”+单词替换它们。

例如：

(Skipper Skipper) -> (S Skipper)
('s 's) -> (S 's)

这是字符串，s:

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

预期结果：

out = "(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own)))))) 
       (S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"

我尝试过：

from collections import Counter

lst = s.lstrip("(").rstrip(")").replace("(", "").replace(")", "").split()
d = Counter(lst)
mapper = {((k + " ") * v).strip():"S" + " " + k for k, v in d.items()}
for k, v in mapper.items():
    out = s.replace(k, v)

但不是很正确：

out = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (it it) (S (S signed) (S (a a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (for for) (S (S (a a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) (S (S does) (S (S n't) (S own)))))) 
       (S (for for) (S (S (S 11.50) (S (a a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"

【问题讨论】：

括号内只有重复，是不是一直这样？
是的，总是如此。
`抱歉，编辑了我的问题

标签： python regex string replace

【解决方案1】：

您可以在正则表达式中使用re.sub 和反向引用。

要查找重复的单词，您可以使用 \1 引用捕获的第一组匹配，并使用 \g<1> 在 repl 参数中引用它。像这样：

res = re.sub(r"([\w.'%]+)\s+\1", r"S \g<1>", s)

【讨论】：

【解决方案2】：

您可能想在此处查看正则表达式。我创建了一个demo，它将匹配所有内括号。

有了这些，您可以分析每个匹配项的内容并根据您的要求进行替换：

import re

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) \
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) \
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) \
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) \
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) \
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) \
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) \
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

# Finding all inner brackets:
# - (Skipper Skipper)
# - ('s 's)
# - etc.
double_words = re.findall(r"(\((?:\(??[^\(]*?\)))", s)


for double_word in double_words:
    words = double_word.lstrip("(").rstrip(")").split()
    # First and second word are the same
    if words[0]==words[1]:
        # Replace ('s 's) with (S 's)
        s = s.replace(double_word, f'(S {words[0]})')
        
print(s)

输出

(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.)))      (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive)      (S (S merger) (S (S agreement) (S (S for) (S (S (S a)      (S (S National) (S (S Pizza) (S (S Corp.) (S unit)))))      (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %)))      (S (S (S of) (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it)      (S (S does) (S (S n't) (S own)))))) (S (S for) (S (S (S 11.50)      (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))

【讨论】：

【解决方案3】：

使用re.sub 替换它们：

import re

def sub(matched):
    return f"(S {matched.group(2)})" if matched.group(1) == matched.group(2) else str(matched.groups())

s = '''(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))'''

result = re.sub(r"\(([\.\%\'\w\d]+) ([\.\%\'\w\d]+)\)", sub, s)

【讨论】：

【解决方案4】：

有这个解决方案遍历单词列表，查找重复项并替换每个重复项的第一次出现“S”

s = """(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"""

word_list = s.split()

for word, next_word in zip(word_list, word_list[1:]):
    if word.replace('(', '').replace(')', '') == next_word.replace('(', '').replace(')', ''):
        word_list[word_list.index(word)] = "(S"
        

s_new = " ".join(word_list)

【讨论】：