字符串替换的条件答案

【问题标题】：Conditionals on replacement in a string字符串替换的条件
【发布时间】：2014-03-13 01:56:11
【问题描述】：

所以我可能有一个字符串“中国银行”或“中国大使馆”和“国际中国”

我想替换所有国家/地区实例，除非我们有 'of ' 或 'of the '

很明显，这可以通过遍历国家列表来完成，检查名称是否包含国家/地区，然后检查是否存在“属于”或“属于”的国家之前的国家/地区。

如果这些确实存在，那么我们不会删除国家/地区，否则我们会删除国家/地区。示例将变为：

“中国银行”或“中国大使馆”和“国际”

但是迭代可能会很慢，尤其是当您有大量国家/地区列表和大量要替换的文本列表时。

是否有更快、更基于条件的方式来替换字符串？这样我仍然可以使用 Python re 库进行简单的模式匹配？

我的职责是这样的：

def removeCountry(name):
    for country in countries:
        if country in name:
            if 'of ' + country in name:
                return name
            if 'of the ' + country in name:
                return name
            else:
                name =  re.sub(country + '$', '', name).strip()
                return name
    return name

编辑：我确实找到了一些信息here。这确实描述了如何做一个如果，但我真的想要一个如果不是“的” 如果不是'的' 然后替换...

【问题讨论】：

您是否有理由需要在一个正则表达式中同时完成条件和替换？或者，这是一个抽象的例子吗？

标签： python regex string string-substitution

【解决方案1】：

您可以编译几组正则表达式，然后通过它们传递您的输入列表。就像是：重新导入

countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]

def remove_country(s):
    for regex in takes:
        if regex.search(s):
            return s
    for regex in subs:
        s = regex.sub('', s)
    return s

print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')

''' Output:
    the bank of foo
    the bank of the baz
    the nation
'''

这里看起来没有什么比线性时间复杂度更快的了。至少可以避免一百万次重新编译正则表达式并提高常数因子。

编辑：我有一些错别字，但基本想法是合理的并且有效。我已经添加了一个示例。

【讨论】：

谢谢！这看起来是一个很好的解决方案。我也可以通过最初检查一个国家是否存在于 s 中来加快速度
@redrubia 你确定这个解决方案没问题吗？我刚刚对其进行了测试，看起来它没有返回正确的结果：remove_country('Embassy of China') 结果为''（空字符串）。而不是regex.match(s)，它应该是regex.find(s)。（而regex.replace 应该是regex.sub）。
请原谅：re.search(s).
是的，这是不正确的答案，刚刚也尝试过，似乎它产生了一个空字符串
几个错别字，是的。我猜你正在寻找一个即插即用的解决方案......无论如何它都有效，我已经为你发布了可复制粘贴的代码。

【解决方案2】：

我认为您可以使用Python: how to determine if a list of words exist in a string 中的方法查找提及的任何国家/地区，然后从那里进行进一步处理。

类似

countries = [
    "Afghanistan",
    "Albania",
    "Algeria",
    "Andorra",
    "Angola",
    "Anguilla",
    "Antigua",
    "Arabia",
    "Argentina",
    "Armenia",
    "Aruba",
    "Australia",
    "Austria",
    "Azerbaijan",
    "Bahamas",
    "Bahrain",
    "China",
    "Russia"
    # etc
]

def find_words_from_set_in_string(set_):
    set_ = set(set_)
    def words_in_string(s):
        return set_.intersection(s.split())
    return words_in_string

get_countries = find_words_from_set_in_string(countries)

然后

get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")

set(['Argentina', 'China', 'Russia'])

...这显然需要更多的后期处理，但很快就会告诉您确切需要查找的内容。

正如链接文章中指出的那样，您必须警惕以标点符号结尾的单词 - 这可以通过 s.split(" \t\r\n,.!?;:'\"") 之类的方式处理。您可能还想查找形容词形式，例如“俄罗斯”、“中国”等。

【讨论】：

【解决方案3】：

未测试：

def removeCountry(name):
    for country in countries:
          name =  re.sub('(?<!of (the )?)' + country + '$', '', name).strip()

使用否定的lookbehind re.sub 只是在country 前面没有of 或of 时匹配和替换

【讨论】：

好主意，但不起作用：'look-behind 需要固定宽度的模式'

【解决方案4】：

re.sub 函数接受一个函数作为替换文本，调用该函数是为了获取应在给定匹配中替换的文本。所以你可以这样做：

import re

def make_regex(countries):
    escaped = (re.escape(country) for country in countries)
    states = '|'.join(escaped)
    return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))

def remove_name(match):
    name = match.group()
    if name.lstrip().startswith('of'):
        return name
    else:
        return name.replace(match.group('state'), '').strip()

regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'

结果可能包含一些虚假空格（在上述情况下，需要最后一个 strip()）。您可以将正则表达式修改为：

\s*(of(\sthe)?\s)?(?P<state>({}))

捕获of 或国家/地区名称之前的空格，避免输出中出现错误的空格。

请注意，此解决方案可以处理整个文本，而不仅仅是 Something of Country 和 Something Country 形式的文本。例如：

In [38]: regex = make_regex(['China'])
    ...: text = '''This is more complex than just "Embassy of China" and "International China"'''

In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'

另一个示例用法：

In [33]: countries = [
    ...:     'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
    ...:     'France', 'Italy', 'Australia', 'New Zealand', 'Brazil', 
    ...:     'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
    ...:     'Spain', 'Portugal', 'Argentina', 'San Marino'
    ...: ]

In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'

In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)

In [36]: regex = make_regex(countries)
    ...: result = regex.sub(remove_name, text)

In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'

【讨论】：

似乎此解决方案在 Python 2.7 中出现错误：“抱歉，此版本仅支持 100 个命名组”
@redrubia 已修复。但是我相信错误信息是虚假的。正则表达式总是有 1 个 named 组，但有许多未命名组。