【问题标题】:String substitution performance in pythonpython中的字符串替换性能
【发布时间】:2016-02-25 15:01:45
【问题描述】:

我有一个包含约 50,000 个字符串(标题)的列表,以及一个包含约 150 个单词的列表,如果它们被发现的话,可以从这些标题中删除。到目前为止,我的代码如下。最终输出应该是 50,000 个字符串的列表,删除了 150 个单词的所有实例。我想知道这样做的最有效(性能方面)的方法。我的代码似乎正在运行,尽管速度很慢..

excludes = GetExcludes()
titles = GetTitles()
titles_alpha = []
titles_excl = []
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

    #remove extra white space
    s = re.sub( '\s+', ' ', s).strip()

    #lowercase
    s = s.lower()

    titles_alpha.append(s)
    #remove any excluded words


    for i in range (len(excludes)):
        titles_excl.append(titles_alpha[k].replace(excludes[i],''))

print titles_excl

【问题讨论】:

  • 这看起来不对。对于excludes 中的每个项目,您将titles_alpha[k].replace(...) 附加到titles_excl 一次。意思是titles_excl 最终会得到 50000*150 个项目,而不仅仅是 50000 个。我建议用较小的输入测试您的代码 - 例如,10 个标题和 3 个排除项 - 以确认它按预期工作,然后再运行它大数据。

标签: python string performance list


【解决方案1】:

正则表达式的很多性能开销来自于编译正则表达式。您应该将正则表达式的编译移出循环。

这应该会给您带来相当大的改进:

pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub(pattern1,' ',titles[k])

    #remove extra white space
    s = re.sub(pattern2,' ', s).strip()

来自herewordlist.txt 的一些测试:

import re
def noncompiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

        #remove extra white space
        s = re.sub( '\s+', ' ', s).strip()

def compiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)



In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop

In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop

要从排除列表中删除“坏词”,您应该按照@zsquare 的建议创建一个加入的正则表达式,这很可能是您可以获得的最快的。

def with_excludes():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    excludes = ["shit","poo","ass","love","boo","ch"]
    excludes_regex = re.compile('|'.join(excludes))
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)
        #remove bad words
        s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop

您可以通过编译一个主正则表达式来进一步采用这种方法:

def master():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    excludes = ["shit","poo","ass","love","boo","ch"]
    nonalpha='[^0-9a-zA-Z]+'
    whitespace='\s+'
    badwords = '|'.join(excludes)
    master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))

    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop

您可以通过避免在 python 中的迭代来获得更快的速度:

    result = [master_regex.sub('',item) for item in titles]


In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop

注意:数据生成步骤:

def baseline():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]

In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop

【讨论】:

    【解决方案2】:

    执行此操作的一种方法是动态创建排除单词的正则表达式并在列表中替换它们。

    类似:

    excludes_regex = re.compile('|'.join(excludes))
    titles_excl = []
    for title in titles:
        titles_excl.append(excludes_regex.sub('', title))
    

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-08-11
    • 2019-10-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-08-03
    相关资源
    最近更新 更多