使用带有 "\s" 的正则表达式并执行简单的 string.split() 将也删除其他空格 - 例如换行符、回车符、制表符。除非需要这样做,否则仅多个空格,我将提供这些示例。
我使用11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum 进行实际时间测试,并在整个过程中使用了随机长度的额外空格:
original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))
单行基本上会删除任何前导/尾随空格,并保留前导/尾随空格(但只有 ONE ;-)。
# setup = '''
import re
def while_replace(string):
while ' ' in string:
string = string.replace(' ', ' ')
return string
def re_replace(string):
return re.sub(r' {2,}' , ' ', string)
def proper_join(string):
split_string = string.split(' ')
# To account for leading/trailing spaces that would simply be removed
beg = ' ' if not split_string[ 0] else ''
end = ' ' if not split_string[-1] else ''
# versus simply ' '.join(item for item in string.split(' ') if item)
return beg + ' '.join(item for item in split_string if item) + end
original_string = """Lorem ipsum ... no, really, it kept going... malesuada enim feugiat. Integer imperdiet erat."""
assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string)
#'''
# while_replace_test
new_string = original_string[:]
new_string = while_replace(new_string)
assert new_string != original_string
# re_replace_test
new_string = original_string[:]
new_string = re_replace(new_string)
assert new_string != original_string
# proper_join_test
new_string = original_string[:]
new_string = proper_join(new_string)
assert new_string != original_string
注意: “while 版本”复制了original_string,因为我相信一旦在第一次运行时修改,连续运行会更快(如果只是一点点)。由于这增加了时间,我将此字符串副本添加到其他两个中,以便时间仅在逻辑上显示差异。 Keep in mind that the main stmt on timeit instances will only be executed once;我这样做的原始方式,while 循环在同一标签上工作,original_string,因此第二次运行,将无事可做。它现在的设置方式,调用一个函数,使用两个不同的标签,这不是问题。我已经向所有工作人员添加了assert 语句,以验证我们每次迭代都会更改某些内容(对于那些可能持怀疑态度的人)。例如,更改为 this 并且它会中断:
# while_replace_test
new_string = original_string[:]
new_string = while_replace(new_string)
assert new_string != original_string # will break the 2nd iteration
while ' ' in original_string:
original_string = original_string.replace(' ', ' ')
Tests run on a laptop with an i5 processor running Windows 7 (64-bit).
timeit.Timer(stmt = test, setup = setup).repeat(7, 1000)
test_string = 'The fox jumped over\n\t the log.' # trivial
Python 2.7.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001066 | 0.001260 | 0.001128 | 0.001092
re_replace_test | 0.003074 | 0.003941 | 0.003357 | 0.003349
proper_join_test | 0.002783 | 0.004829 | 0.003554 | 0.003035
Python 2.7.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001025 | 0.001079 | 0.001052 | 0.001051
re_replace_test | 0.003213 | 0.004512 | 0.003656 | 0.003504
proper_join_test | 0.002760 | 0.006361 | 0.004626 | 0.004600
Python 3.2.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001350 | 0.002302 | 0.001639 | 0.001357
re_replace_test | 0.006797 | 0.008107 | 0.007319 | 0.007440
proper_join_test | 0.002863 | 0.003356 | 0.003026 | 0.002975
Python 3.3.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001444 | 0.001490 | 0.001460 | 0.001459
re_replace_test | 0.011771 | 0.012598 | 0.012082 | 0.011910
proper_join_test | 0.003741 | 0.005933 | 0.004341 | 0.004009
test_string = lorem_ipsum
# Thanks to http://www.lipsum.com/
# "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"
Python 2.7.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.342602 | 0.387803 | 0.359319 | 0.356284
re_replace_test | 0.337571 | 0.359821 | 0.348876 | 0.348006
proper_join_test | 0.381654 | 0.395349 | 0.388304 | 0.388193
Python 2.7.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.227471 | 0.268340 | 0.240884 | 0.236776
re_replace_test | 0.301516 | 0.325730 | 0.308626 | 0.307852
proper_join_test | 0.358766 | 0.383736 | 0.370958 | 0.371866
Python 3.2.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.438480 | 0.463380 | 0.447953 | 0.446646
re_replace_test | 0.463729 | 0.490947 | 0.472496 | 0.468778
proper_join_test | 0.397022 | 0.427817 | 0.406612 | 0.402053
Python 3.3.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.284495 | 0.294025 | 0.288735 | 0.289153
re_replace_test | 0.501351 | 0.525673 | 0.511347 | 0.508467
proper_join_test | 0.422011 | 0.448736 | 0.436196 | 0.440318
对于琐碎的字符串,while 循环似乎是最快的,其次是 Pythonic 字符串拆分/连接,而正则表达式拉到后面。
对于非平凡的字符串,似乎还有更多需要考虑的地方。 32位2.7?这是救援的正则表达式! 2.7 64 位? while 循环是最好的,并且有相当大的优势。 32 位 3.2,使用“正确的”join。 64 位 3.3,使用 while 循环。再次。
最后,如果/在哪里/在需要时可以提高性能,但最好是remember the mantra:
- 让它发挥作用
- 改正
- 加快速度
IANAL、YMMV、自告人!