多进程比较多个 .txt 文件中的字符串？答案

【问题标题】：Multiprocess to compare strings in multi .txt files?多进程比较多个 .txt 文件中的字符串？
【发布时间】：2021-07-06 18:39:39
【问题描述】：

我有几个 txt 文件，每个文件大约有一百万行，搜索等式大约需要一分钟。文件保存为 0.txt、1.txt、2.txt、...为方便起见，in_1 和 searchType 是用户给定的输入。

class ResearchManager():
def __init__(self,searchType,in_1,file):
    self.file = file
    self.searchType = searchType
    self.in_1 = in_1
    
def Search(self):
    
    current_db = open(str(self.file) + ".txt",'r')
    .
    .
    .

    #Current file processing


if __name__ == '__main__':

n_file = 35
for number in range(n_file):
    RM = ResearchManager(input_n, input_1, number)
    RM.Search()

我想使用多处理优化搜索过程，但没有成功。有没有办法做到这一点？谢谢。

编辑。

我能够以这种方式使用线程：

class ResearchManager(threading.Thread):
def __init__(self, searchType, in_1, file):
    threading.Thread.__init__(self)
    self.file = file
    self.searchType = searchType
    self.in_1 = in_1
    
def run(self):
current_db = open(str(self.file) + ".txt",'r')
.
.
.

#Current file processing

...

        threads=[]
        for number in range(n_file+1):
            
            threads.append(ResearchManager(input_n,input_1,number))

        start=time.time()
        
        for t in threads:
            t.start()
            
        for t in threads:
            t.join()
        end=time.time()

但是总执行时间比正常的for循环还要长几秒。

【问题讨论】：

你可以先用 ThreadPoolExecutor 实现代码。并稍后更改为 ProcessPoolExecuter。如果在转换中出现任何错误，很可能是由于酸洗对象，需要重构。确保提交给 ProcessPoolExecutor 的任务和参数都是可挑选的，避免文件对象、lambda/嵌套函数等。
我试图改编 here 所说的话。谢谢你的建议，我去看看。
multiprocessing.dummy.ThreadPool 是multiprocessing.Pool 的基于线程的直接替换。

标签： python performance for-loop multiprocessing

【解决方案1】：

你能展示你在线程方面的尝试吗？看看这篇文章，它很好地提供了对 python 线程如何工作的基本理解。

https://realpython.com/intro-to-python-threading/

import logging
import threading
import time

def thread_function(name):
    logging.info("Thread %s: starting", name)
    time.sleep(2)
    logging.info("Thread %s: finishing", name)

if __name__ == "__main__":
    format = "%(asctime)s: %(message)s"
    logging.basicConfig(format=format, level=logging.INFO,
                        datefmt="%H:%M:%S")

    threads = list()
    for index in range(3):
        logging.info("Main    : create and start thread %d.", index)
        x = threading.Thread(target=thread_function, args=(index,))
        threads.append(x)
        x.start()

    for index, thread in enumerate(threads):
        logging.info("Main    : before joining thread %d.", index)
        thread.join()
        logging.info("Main    : thread %d done", index)

【讨论】：

GIL 将阻止线程的真正性能提升。
我也是新手，正在学习。您介意详细说明，而这会妨碍性能提升吗？