大型文本文件的并行计算答案

【问题标题】：Parallel Computing for Large Text Files大型文本文件的并行计算
【发布时间】：2019-12-04 22:11:40
【问题描述】：

我正在尝试在非常大的文本文件中查找一些拼写错误并进行更正。基本上，我运行这段代码：

ocr = open("text.txt")
text = ocr.readlines()
clean_text = []
for line in text:
    last = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,",     "1\\2\t\\3\\4,",     line)
    clean_text.append(last) 
new_text = open("new_text.txt", "w", newline="\n") 
for line in clean_text:
    new_text.write(line)
new_text.close()

实际上我使用 're.sub' 函数超过 1500 次，而 'text.txt' 有 100.000 行。我可以将我的文本分成几部分，并为不同的部分使用不同的核心吗？

【问题讨论】：

我不知道python是如何处理re的，但一般来说最好调用re.compile()一次，re.execute()重复调用。

标签： python python-3.x text parallel-processing

【解决方案1】：

这会将文本处理功能（当前使用您问题中的re.sub）应用于NUM_CORES 相同大小的输入文本文件块，然后将它们写出（保留原始文本输入文件的顺序）。

from multiprocessing import Pool, cpu_count
import numpy as np
import re

NUM_CORES = cpu_count()

def process_text(input_textlines):
    clean_text = []
    for line in input_textlines:
        cleaned = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
        clean_text.append(cleaned)
    return "".join(clean_text)

# read in data and convert to sequence of equally-sized chunks
with open('data/text.txt', 'r') as f:
    lines = f.readlines()

num_lines = len(lines)
text_chunks = np.array_split(lines, NUM_CORES)

# process each chunk in parallel
pool = Pool(NUM_CORES)
results = pool.map(process_text, text_chunks)

# write out results
with open("new_text.txt", "w", newline="\n") as f:
    for text_chunk in results:
        f.write(text_chunk)

【讨论】：

谢谢。有用。在使用此代码之前，该程序需要 13 分钟。现在需要 2 分钟。
太棒了，很高兴听到它！