re 也可以用于全局捕获:
>>> s = 'The 7 quick brown foxes jumped 7 times over 7 lazy dogs'
>>> sep = '7'
>>>
>>> [i for i in re.split(f'({sep}[^{sep}]*)', s) if i]
['The ', '7 quick brown foxes jumped ', '7 times over ', '7 lazy dogs']
如果 f 字符串难以阅读,请注意它的计算结果为 (7[^7]*)。
(和listcomp一样可以使用list(filter(bool, ...)),但是比较丑)
在 Python 3.7 及更高版本中,re.split() 允许在零宽度模式上进行拆分。这意味着可以使用前瞻正则表达式,即f'(?={sep})',而不是上面显示的组。
奇怪的是时间:如果使用re.split()(即没有编译的模式对象),组解决方案的运行速度始终比前瞻快约 1.5 倍。但是,在编译时,前瞻优于其他:
In [4]: r_lookahead = re.compile('f(?={sep})')
In [5]: r_group = re.compile(f'({sep}[^{sep}]*)')
In [6]: %timeit [i for i in r_lookahead.split(s) if i]
2.76 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit [i for i in r_group.split(s) if i]
5.74 µs ± 65.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit [i for i in r_lookahead.split(s * 512) if i]
137 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [9]: %timeit [i for i in r_group.split(s * 512) if i]
1.88 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
递归解决方案也可以正常工作,尽管比在已编译的正则表达式上拆分要慢(但比直接 re.split(...) 快):
def splitkeep(s, sep, prefix=''):
start, delim, end = s.partition(sep)
return [prefix + start, *(end and splitkeep(end, sep, delim))]
>>> s = 'The 7 quick brown foxes jumped 7 times over 7 lazy dogs'
>>>
>>> splitkeep(s, '7')
['The ', '7 quick brown foxes jumped ', '7 times over ', '7 lazy dogs']