【问题标题】:Replace multiple (special) characters - most efficient way?替换多个(特殊)字符 - 最有效的方法?
【发布时间】:2019-05-30 13:01:58
【问题描述】:

在我拥有的文本中,我想用一个空格替换以下特殊字符:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]

什么是最有效的方法(就代码执行时间而言)?

例如,我想要这个:

(Hello World)] *!

变成这样:

Hello World

候选方法似乎如下:

  1. 列表理解
  2. .replace()
  3. .translate()
  4. 正则表达式

【问题讨论】:

  • 请澄清。你想用空格替换每个字符吗?或者你想完全删除每个字符,什么都不替换?因为(Hello World)] *! 不会变成Hello World 当你用空格替换它的所有特殊字符时。它变成[one space]Hello World[five spaces]
  • @Kevin,你能不能两者都做,或者至少是后者?

标签: python python-3.x string


【解决方案1】:

要获得有效的解决方案,您可以为此使用str.maketrans。请注意,一旦定义了转换表,只需映射字符串中的字符即可。您可以这样做:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+",
           "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]

首先使用 dict.fromkeys 将单个空格设置为每个条目的值从符号创建字典,然后从字典创建翻译表:

d = dict.fromkeys(''.join(symbols), ' ')
# {'`': ' ', ',': ' ', '~': ' ', '!': ' ', '@': ' '...
t = str.maketrans(d)

然后调用字符串translate方法将上述字典中的字符映射为一个空格:

s = '~this@is!a^test@'
s.translate(t)
# ' this is a test '

【讨论】:

  • 感谢您的回答(赞成)。你能解释一下"\\""\""吗?我很不清楚这些删除了什么。显然,它们的编写方式也与反斜杠用于转义这一事实有关。
  • 它们将分别删除字符 \ 和 ",但是它们需要被分隔,因为它们具有特殊含义。这是通过在 @PoeteMaudit 前面加上反斜杠来完成的
  • 不客气。如果它为您解决了,请不要忘记您可以接受 :) @PoeteMaudit
  • 当然它为我解决了,但下面的一些人已经做了一个完整的比较,所以不确定接受哪个解决方案:D
  • str.translate() 的所有变体都将具有大致相同的性能,并且比其他替代方案要快得多。如果你想用空格替换符号,你应该接受这个答案(或者如果你想完全删除它们:)。
【解决方案2】:

在进行了一些测试之后,我可以说str.translate() 是最好的变体。

输入数据:

symbols = {"`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"}
translate_table = {126: None, 93: None, 91: None, 125: None, 92: None, 42: None, 45: None, 94: None, 62: None, 47: None, 35: None, 59: None, 44: None, 58: None, 60: None, 124: None, 61: None, 36: None, 95: None, 43: None, 96: None, 123: None, 64: None, 33: None, 38: None, 63: None, 46: None, 34: None, 41: None, 37: None, 40: None}
regular_expression = "[`~!@#$%^&*()_\-+={[\]}|\\:;\"<,>.?/]"
small_document = "Some**r@an]]\"dom t##xt"
normal_document = "TbsX^Kt$FZ%haZe+sLxu:Al\"xNAL\\Kix[mHp_gn]PrG`DqGd~GdNc;BoEq.SYD?Rp>ukq,UfO<XdTc=RUH}oifc&oP!CB*me@Qv{Qf-Li)gmXL/IQH#mne(Khaj|"
big_document = "QOfY+dymyoGBAxTAoIeM+jEWlaECUZEUXuMvprJOqFtQR*OiHtTFZkUNbYipSTTDPOVkIdGTcjWrQmbmthKBHBSEOZ)lQAIJOrVgmGGFdtqbuFfj<Dls<JWtKczAFMPYMemiJBJHdPeeul\\x>lGIBvUsxBokagvVovrrdxdKMtAKx>MEexYv>DGqPUXYaBQKwiSIUobrPQYjilhHMQunE;RiqOZPTnyOEgRrpxcuobvvmGkFpTqgMxYYhrmRRnauiqgvCmZ\"UauceaXsgAMSakxewzPrlIrYkVCVZaEGh]qiizYyzbkcHPF@qQsQMfHPDEbEnWtrCFoARUYAloOcctqmL@hegZbfhsHaJOxOxzQhZAVjVDgokosATfhKMT!WYyPWKcKAHKCzQGGJOCglYGZbftsuyntXZUKNqgGlsLJqgN,pUcOoA/tStXFXgpoSErgvw/OUMPWjJwt=bhMAIDayOZXJm=ifYYUuAvSIZjwnBfktNvEvZmvQso%HiNZEVqoDR%nQBtCkhjSfVfDuRSRsvp-sCunjDDUYSEVLICQdisxhEfqkUTkiPlLiUNNwrvO#WTDmweZyMeIbgNXkIsvaJeHYXV(HvRcGNZM(PPRIAyyLWivGiqMVBtwObqLfEEISyyjGNEdUU:ys`dXcVawkIEAjFXky`RUXNTm`LDM}mwTOcmsSo}haJXPnkwOhKLYwve}SWifzKq}grw}fMSQXXWguUQtlWpPZQymR^wBKEyolFlZnzEEmehSNenOqDOHWRit[Npm?R?DIPXAmQYYBbmJofxUzzWBsVCoPI?VmpXhoMxCfXyHEHowXzIJvExThiffLhBTtma_jk_NrbkPCGGypXvOuBqBxDYfC{bwIHoaqnJSKytxwWXBNnKG~PKuQklGblEwH~rJoGpKZmm~tTEFnPLdmzfrqJibMYIykzL$RZLPmsZjB$AAbZwFnByOydEOIfFvTaEQaSjbpeBZuUGY&ZfPQgLihmPYrhZxSwMzLrNF.WjFiDCLyXksdkLeMHVCfrdgCAotElQ|"
no_match_document = "XOtasggWqhtSLJpHEGoCmMRepFBlRfAGKTLPcEtKonFVsPgvWgAbvJVeMWILPgLapwAmTgXWVbxOJtUFmMygzIqYPqyAxzwElTFyYcGdtnNa"

代码:

def func1(doc):
    for c in symbols:
        doc = doc.replace(c, "")
    return doc


def func2(doc):
    return doc.translate(translate_table)


def func3(doc):
    return re.sub(regular_expression, "", doc)


def func4(doc):
    return "".join(c for c in doc if c not in symbols)

测试结果:

func1(small_document):      0.701037002
func1(normal_document):     1.1260866900000002
func1(big_document):        3.4234831459999997
func1(no_match_document):   0.7740780450000004

func2(small_document):      0.14135037500000003
func2(normal_document):     0.5368806810000004
func2(big_document):        0.8128472860000002
func2(no_match_document):   0.394245089

func3(small_document):      0.3157141610000007
func3(normal_document):     0.927359323000001
func3(big_document):        1.9310377590000005
func3(no_match_document):   0.18656399199999996

func4(small_document):      0.3034549070000008
func4(normal_document):     1.3695875739999988
func4(big_document):        10.115730064
func4(no_match_document):   1.2086623230000022

UPD。

我提供的输入数据是专门为纯方法测试“准备的”。

为了生成translate_table,我使用了下一个字典理解:

translate_table = {ord(s): None for s in symbols}

Here 是用于正则表达式验证的网站链接(它可能会有所帮助)。


如果你想自己重新计算测试,这里是代码:

    if __name__ == '__main__':
    import timeit
    print("func1(small_document)", timeit.timeit("func1(small_document)", setup="from __main__ import func1, small_document", number=100000))
    print("func1(normal_document): ", timeit.timeit("func1(normal_document)", setup="from __main__ import func1, normal_document", number=100000))
    print("func1(big_document): ", timeit.timeit("func1(big_document)", setup="from __main__ import func1, big_document", number=100000))
    print("func1(no_match_document): ", timeit.timeit("func1(no_match_document)", setup="from __main__ import func1, no_match_document", number=100000))

    print("func2(small_document): ", timeit.timeit("func2(small_document)", setup="from __main__ import func2, small_document", number=100000))
    print("func2(normal_document): ", timeit.timeit("func2(normal_document)", setup="from __main__ import func2, normal_document", number=100000))
    print("func2(big_document): ", timeit.timeit("func2(big_document)", setup="from __main__ import func2, big_document", number=100000))
    print("func2(no_match_document): ", timeit.timeit("func2(no_match_document)", setup="from __main__ import func2, no_match_document", number=100000))

    print("func3(small_document): ", timeit.timeit("func3(small_document)", setup="from __main__ import func3, small_document", number=100000))
    print("func3(normal_document): ", timeit.timeit("func3(normal_document)", setup="from __main__ import func3, normal_document", number=100000))
    print("func3(big_document): ", timeit.timeit("func3(big_document)", setup="from __main__ import func3, big_document", number=100000))
    print("func3(no_match_document): ", timeit.timeit("func3(no_match_document)", setup="from __main__ import func3, no_match_document", number=100000))

    print("func4(small_document): ", timeit.timeit("func4(small_document)", setup="from __main__ import func4, small_document", number=100000))
    print("func4(normal_document): ", timeit.timeit("func4(normal_document)", setup="from __main__ import func4, normal_document", number=100000))
    print("func4(big_document): ", timeit.timeit("func4(big_document)", setup="from __main__ import func4, big_document", number=100000))
    print("func4(no_match_document): ", timeit.timeit("func4(no_match_document)", setup="from __main__ import func4, no_match_document", number=100000))

【讨论】:

  • 嘿,谢谢,我真的很想看到这个完整的比较(upvote)。但是,您的代码并不清楚您的变量的值是什么,例如translate_table
  • 您使用的正则表达式是什么?这将决定复杂性,即时间。例如使用re.sub('(?i)[^a-z ]+','',doc)
  • @PoeteMaudit,我在帖子中添加了输入数据。
  • 还有你的翻译表,你已经有了 ascii 中的值。使用原始值而不是 assci 进行比较
  • @Onyambu 老实说我还没有使用过任何一个,因为很难想出一个没有错误的 - 这是正则表达式的问题;很难学习和写作。你可以看看这里:stackoverflow.com/questions/56376461/…
【解决方案3】:
s = '''
def translate_():
    symbols = '`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/'
    s = '~this@is!a^test @'
    t = str.maketrans(dict.fromkeys(symbols, ' '))
    s.translate(t)
    return s

def replace_():
    symbols = '`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/'
    s = '~this@is!a^test @'
    for symbol in symbols:
        s = s.replace(symbol, ' ')
    return s
'''

print(timeit.timeit('replace_()', setup=s, number=100000))
print(timeit.timeit('translate_()', setup=s, number=100000))

将打印:

0.7663131961598992

0.4139239452779293

因此用translate 替换比使用多个replaces 快近2 倍。

【讨论】:

  • 赞成显示timeit 结果。但请注意,结果会随着字符串的长度和需要替换的字符数而变化。
【解决方案4】:

我的代码用空格替换符号并且不删除这些空格。

对于短字符串.join() 很快,但对于较大的字符串.translate() 如果有很多要替换的话会更快。令人惊讶的是,.replace() 在几乎没有替换的情况下仍然非常快。

text: '(Hello World)] *!'
using_replace                     0.046
using_join                        0.016
using_translate                   0.031

text: '~this@is!a^test@'
using_replace                     0.046
using_join                        0.017
using_translate                   0.029

text: '~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@'
using_replace                     0.195
using_join                        2.327
using_translate                   0.061

text: 'a long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replace'
using_replace                     0.051
using_join                        2.100
using_translate                   0.064

比较一些策略:

def using_replace(text, symbols_to_replace, replacement=' '):
    for char in symbols_to_replace:
        text = text.replace(char, replacement)

    return text

def using_join(text, symbols_to_replace, replacement=' '):
    return ''.join(
        replacement if char in symbols_to_replace else char
        for char in text)

def using_translate(text, symbols_to_replace, replacement=' '):
    translation_dict = str.maketrans(
        dict.fromkeys(symbols_to_replace, replacement))

    return text.translate(translation_dict)

这个timeit 代码用于不同的文本:

    # a 'set' for faster lookup
    symbols = {
        '`', '~', '!', '@', '#', '$', '%', '^', '&', '*',
        '(', ')', '_', '-', '+', '=', '{', '[', ']', '}',
        '|', '/', ':', ';', '"', '<', ',', '>', '.', '?',
        '\\',
    }

    text_list = [
        '(Hello World)] *!',
        '~this@is!a^test@',
        '~/()&this@isasd!&=)(/as/dw&%#a^test@' * 1000,
        'a long text without chars to replace' * 1000,
    ]
    for s in text_list:
        assert (
                using_replace(s, symbols)
                == using_join(s, symbols)
                == using_translate(s, symbols))

    for s in text_list:
        print()
        print('text:', repr(s))
        for func in [using_replace, using_join, using_translate]:
            t = timeit.timeit(
                'func(s, symbols)',
                'from __main__ import func, s, symbols',
                number=10000)
            print('{:30s} {:8.3f}'.format(func.__name__, t))

【讨论】:

  • 注意:.translate() 显示取决于字符串长度的线性时间 (O(n))。
  • 另一方面,.replace() 如果替换很少,则速度很快,如果替换很多,则速度较慢。
  • 看起来很有趣而且很全面,谢谢(点赞)。那么你能说一下这些方法的大 O 计算复杂度是多少吗?
  • .translate() 是线性的(但系数非常小,小于 1)并且取决于字符串的长度和(在较小程度上)转换表的大小。
  • .join()replace() 更难确定,但也可能是线性的(系数为 1)。而且它们还受到需要替换多少字符的很大影响,因此它是非常可变的,而不仅仅是简单的线性复杂性。
【解决方案5】:

str.translate() 确实是最快的方法。这是构建排除字符的转换表的简洁方法:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]
removeSymbols = str.maketrans("","","".join(symbols))

cleanText = "[Hello World] *!".translate(removeSymbols)
print(cleanText) # "Hello World "

maketrans() 函数可以接受 3 个参数,第一个是要替换的字符的字符串,第二个是它们的替换,第三个是应该删除的字符列表。直截了当地删除所有字符,我们只需要为第三个参数提供一个包含要删除的符号的字符串。

转换表removeSymbols 然后将符号列表中的字符完全删除。

要替换为空格,请像这样构建转换表:

removeSymbols = str.maketrans("".join(symbols)," "*len(symbols))

【讨论】:

  • Hm 看起来很有趣,而且很有效(赞成)。但我想了解一下它的作用。 *3 有什么意义?你是用空格替换这些符号还是什么都不加?
  • @Poete Maudit,这实际上比它需要的要复杂。我不必提供符号字符串 3 次。仅将它用于 maketrans() 的第三个参数就足够了。我调整了示例。
  • 是的,实际上我认为在这方面@yatu上面的回答是最简洁的。
【解决方案6】:

虽然 Roght 的答案是最好的 IMO 并且它显示了客观的方法,但我想注意到 translate 并不总是最好的!您确实需要自己检查一下,结果将取决于您的输入。

一点复杂性理论

(免责声明:我没有研究 Python 源代码,所以下面是我所期望的),因为我们有 K 符号要替换,N 符号在源字符串中:

str.replace 基本上应该遍历整个字符串,检查每个符号,如果它与参数匹配则替换它。看起来像纯 O(N) ,因此对于 K 替换它将是 O(K*N)

另一方面,translate 应该只遍历整个字符串一次,检查翻译表中的每个符号是否匹配。由于翻译表是一个hashmap,所以查找有O(1),因此整个翻译根本不依赖K,应该是O(N)

问题 - 为什么在我的情况下 replace 更快???我不知道:(


我在重构我们的脚本分析测试日志(相当大的文件,想想 60Mb+)时遇到了这个问题,它正在从一些随机符号中清理它以及进行一些 HTML 清理,这里是替换字典:

replace_dict = {
        "&": "&amp;",
        "\"": "&quot;",
        "<": "&lt;",
        ">": "&gt;",
        "\u0000": "",
        "\u0007": "",
        "\u0008": "",
        "\u001a": "",
        "\u001b": "",
    }

当我看到初始代码在行中只有 9 个 str.replace 调用时,这是我的第一个想法 - “wtf,让我们改用 translate”,这一定要快得多。但是在我的情况下,我发现replace 实际上是最快的方法。

测试脚本:

replace_dict = {
    "&": "&amp;",
    "\"": "&quot;",
    "<": "&lt;",
    ">": "&gt;",
    "\u0000": "",
    "\u0007": "",
    "\u0008": "",
    "\u001a": "",
    "\u001b": "",
}

symbols = list(replace_dict.keys())
translate_table = {ord(k): v if v else None for k, v in replace_dict.items()}
with open("myhuge.log") as f:
    big_document = f.read()


def func_replace(doc):
    for k, v in replace_dict.items():
        doc = doc.replace(k, v)
    return doc


def func_trans(doc):
    return doc.translate(translate_table)


def func_list_comp(doc):
    # That's not really equivalent to two methods above, but still good for perf comparison
    return "".join(c for c in doc if c not in symbols)


if __name__ == '__main__':
    import timeit
    number = 5
    print("func_replace(big_document): ", timeit.timeit("func_replace(big_document)",
          setup="from __main__ import func_replace, big_document", number=number))

    print("func_trans(big_document): ", timeit.timeit("func_trans(big_document)",
          setup="from __main__ import func_trans, big_document", number=number))

    print("func_list_comp(big_document): ", timeit.timeit("func_list_comp(big_document)",
          setup="from __main__ import func_list_comp, big_document", number=number))

结果如下:

func_replace(big_document): 4.945449151098728

func_trans(big_document): 15.22288554534316

func_list_comp(big_document): 45.01621600985527

我可以得出两个结论:

  • 列表理解真的很慢,不要用。
  • 与直觉相反,在某些情况下,replace 可能比translate 快几倍。如果您的替换表不是太大,并且您正在处理的字符串太大,似乎replace 会更好。

【讨论】:

  • symbols = list(replace_dict.keys()) => symbols = set(replace_dict)
  • 您也可以尝试将文件读取为二进制文件 ("rb") 并使用 bytes.translate(),这可能会更快。
  • @OlvinR​​oght 我不确定我如何才能在这种情况下使用bytes.translate,因为它似乎仅适用于单字符替换,不适用于字符串替换(例如&amp;quot; 是一个 6 个字符的字符串)。关于更改 \u.. 字符 - 它真的有什么不同吗?我觉得\u-format 看起来更加一致和可读......
  • 我发现\xFF 的选项更具可读性,好吧。但是来自第一条评论的建议仍然很重要。另外,将"".join(...) 更改为"".join([...]),它也会稍微提升最后一个方法。
  • @OlvinR​​oght 但这并不重要,不是吗?我们不是在优化测试程序本身,而是在优化实际的替换方法。这是setup中所有测试方法执行的通用代码,所以即使我在文件顶部添加sleep(1),它也不会改变任何东西。
猜你喜欢
  • 2018-03-28
  • 2017-03-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2010-11-10
相关资源
最近更新 更多