正则表达式的很多性能开销来自于编译正则表达式。您应该将正则表达式的编译移出循环。
这应该会给您带来相当大的改进:
pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub(pattern1,' ',titles[k])
#remove extra white space
s = re.sub(pattern2,' ', s).strip()
来自here 的wordlist.txt 的一些测试:
import re
def noncompiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])
#remove extra white space
s = re.sub( '\s+', ' ', s).strip()
def compiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop
In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop
要从排除列表中删除“坏词”,您应该按照@zsquare 的建议创建一个加入的正则表达式,这很可能是您可以获得的最快的。
def with_excludes():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
excludes = ["shit","poo","ass","love","boo","ch"]
excludes_regex = re.compile('|'.join(excludes))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
#remove bad words
s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop
您可以通过编译一个主正则表达式来进一步采用这种方法:
def master():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
excludes = ["shit","poo","ass","love","boo","ch"]
nonalpha='[^0-9a-zA-Z]+'
whitespace='\s+'
badwords = '|'.join(excludes)
master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop
您可以通过避免在 python 中的迭代来获得更快的速度:
result = [master_regex.sub('',item) for item in titles]
In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop
注意:数据生成步骤:
def baseline():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop