在我看来,您的问题不在于重读文件,而是将长列表的 slices 与短列表匹配。正如其他答案所指出的,您可以使用普通列表或内存映射文件来加速您的程序。
如果您想使用特定的数据结构来进一步加快速度,那么我建议您查看blist,特别是因为它在切片列表方面比标准 Python 列表具有更好的性能:他们声称 O (log n) 而不是 O(n)。
我在 ~10MB 的列表上测量了近 4 倍的加速:
import random
from blist import blist
LINE_NUMBER = 1000000
def write_files(line_length=LINE_NUMBER):
with open('haystack.txt', 'w') as infile:
for _ in range(line_length):
infile.write('an example\n')
with open('needles.txt', 'w') as infile:
for _ in range(line_length / 100):
first_rand = random.randint(0, line_length)
second_rand = random.randint(first_rand, line_length)
needle = random.choice(['an example', 'a sample'])
infile.write('%s\t%s\t%s\n' % (needle, first_rand, second_rand))
def read_files():
with open('haystack.txt', 'r') as infile:
normal_list = []
for line in infile:
normal_list.append(line.strip())
enhanced_list = blist(normal_list)
return normal_list, enhanced_list
def match_over(list_structure):
matches = 0
total = len(list_structure)
with open('needles.txt', 'r') as infile:
for line in infile:
needle, start, end = line.split('\t')
start, end = int(start), int(end)
if needle in list_structure[start:end]:
matches += 1
return float(matches) / float(total)
根据 IPython 的 %time 命令测量,blist 需要 12 秒,而普通的 list 需要 46 秒:
In [1]: import main
In [3]: main.write_files()
In [4]: !ls -lh *.txt
10M haystack.txt
233K needles.txt
In [5]: normal_list, enhanced_list = main.read_files()
In [8]: %time main.match_over(normal_list)
CPU times: user 44.9 s, sys: 1.47 s, total: 46.4 s
Wall time: 46.4 s
Out[8]: 0.005032
In [9]: %time main.match_over(enhanced_list)
CPU times: user 12.6 s, sys: 33.7 ms, total: 12.6 s
Wall time: 12.6 s
Out[9]: 0.005032