在 Python3 中使用多处理进行文件读取答案

【问题标题】：using multiprocessing for file reading in Python3在 Python3 中使用多处理进行文件读取
【发布时间】：2019-04-29 14:26:17
【问题描述】：

我有非常大的文件。每个文件差不多 2GB。因此，我想并行运行多个文件。我可以这样做，因为所有文件都具有相似的格式，因此文件读取可以并行完成。我知道我应该使用多处理库，但我真的很困惑如何将它与我的代码一起使用。

我的文件读取代码是：

def file_reading(file,num_of_sample,segsites,positions,snp_matrix):
    with open(file,buffering=2000009999) as f:
        ###I read file here. I am not putting that code here.
        try:
            assert len(snp_matrix) == len(positions)
            return positions,snp_matrix ## return statement
        except:
            print('length of snp matrix and length of position vector not the same.')
            sys.exit(1)

我的主要功能是：

if __name__ == "__main__":    
    segsites = []
    positions = []
    snp_matrix = []




    path_to_directory = '/dataset/example/'
    extension = '*.msOut'

    num_of_samples = 162
    filename = glob.glob(path_to_directory+extension)

    ###How can I use multiprocessing with function file_reading
    number_of_workers = 10

   x,y,z = [],[],[]

    array_of_number_tuple = [(filename[file], segsites,positions,snp_matrix) for file in range(len(filename))]
    with multiprocessing.Pool(number_of_workers) as p:
        pos,snp = p.map(file_reading,array_of_number_tuple)
        x.extend(pos)
        y.extend(snp)

所以我对该函数的输入如下：

文件 - 包含文件名的列表
num_of_samples - 整数值
segsites - 最初是一个空列表，我在读取文件时要附加到该列表中。
位置 - 最初是一个空列表，我在读取文件时要附加到该列表中。
snp_matrix - 最初是一个空列表，我在读取文件时要附加到该列表中。

函数最后返回位置列表和snp_matrix列表。在我的参数是列表和整数的情况下，如何使用多处理？我使用多处理的方式给了我以下错误：

TypeError：file_reading() 缺少 3 个必需的位置参数：“segsites”、“positions”和“snp_matrix”

【问题讨论】：

对于未来，可能有助于阅读：meta.stackoverflow.com/questions/290746/…（我认为现在的问题与最初的问题相比发生了很大变化——即使你可能想问同样的问题，写的是什么有点不同）。所以我删除了我的答案，因为它现在没有意义......
另外，您应该阅读stackoverflow.com/help/mcve 并尝试使您的问题与那里的描述接近。

标签： python python-3.x python-multiprocessing

【解决方案1】：

列表中传递给 Pool.map 的元素不会自动解包。 'file_reading' 函数中通常只能有一个参数。

当然，这个参数可以是元组，所以自己解包也没问题：

def file_reading(args):
    file, num_of_sample, segsites, positions, snp_matrix = args
    with open(file,buffering=2000009999) as f:
        ###I read file here. I am not putting that code here.
        try:
            assert len(snp_matrix) == len(positions)
            return positions,snp_matrix ## return statement
        except:
             print('length of snp matrix and length of position vector not the same.')
            sys.exit(1)

if __name__ == "__main__":    
    segsites = []
    positions = []
    snp_matrix = []

    path_to_directory = '/dataset/example/'
    extension = '*.msOut'

    num_of_samples = 162
    filename = glob.glob(path_to_directory+extension)

    number_of_workers = 10

    x,y,z = [],[],[]


    array_of_number_tuple = [(filename[file], num_of_samples, segsites,positions,snp_matrix) for file in range(len(filename))]
    with multiprocessing.Pool(number_of_workers) as p:
        pos,snp = p.map(file_reading,array_of_number_tuple)
        x.extend(pos)
        y.extend(snp)

【讨论】：