Python中的多处理内存错误答案

【问题标题】：Memory Error with Multiprocessing in PythonPython中的多处理内存错误
【发布时间】：2014-10-15 06:59:43
【问题描述】：

我正在尝试使用 Python 执行一些昂贵的科学计算。我必须读取存储在 csv 文件中的一堆数据，然后进行处理。由于每个进程都需要很长时间，而且我要使用大约 8 个处理器，因此我尝试使用 Multiprocessing 中的 Pool 方法。

这就是我构建多处理调用的方式：

    pool = Pool()
    vector_components = []
    for sample in range(samples):
        vector_field_x_i = vector_field_samples_x[sample]
        vector_field_y_i = vector_field_samples_y[sample]
        vector_component = pool.apply_async(vector_field_decomposer, args=(x_dim, y_dim, x_steps, y_steps,
                                                                           vector_field_x_i, vector_field_y_i))
        vector_components.append(vector_component)
    pool.close()
    pool.join()

    vector_components = map(lambda k: k.get(), vector_components)

    for vector_component in vector_components:
        CsvH.write_vector_field(vector_component, '../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')

我正在运行一个包含 500 个样本的数据集，其大小等于 100 (x_dim) x 100 (y_dim)。

在那之前一切正常。

然后我收到一个包含 500 个 400 x 400 样本的数据集。

运行它时，调用get时出现错误。

我还尝试运行 400 x 400 的单个样本并得到相同的错误。

Traceback (most recent call last):
  File "__init__.py", line 33, in <module>
    VfD.samples_vector_field_decomposer(samples, x_dim, y_dim, x_steps, y_steps, vector_field_samples_x, vector_field_samples_y)
  File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in samples_vector_field_decomposer
    vector_components = map(lambda k: k.get(), vector_components)
  File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in <lambda>
    vector_components = map(lambda k: k.get(), vector_components)
  File "/export/home/pceccon/.pyenv/versions/2.7.5/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
MemoryError

我该怎么办？

提前谢谢你。

【问题讨论】：

你的内存用完了吗？
好像是这样，既然你提到了（我正在通过 ssh 运行它）。
完全填充vector_components 的内容后，您想做什么？您现在的样本量似乎太大而无法放入内存，因此您一次只能将其中的一部分保存在内存中。
那么您只有 3 个解决方案 - 更小的数据集/将您的数据分成块并独立处理它们/获得更多内存
我想保存它们中的每一个（在另一个 csv 中），因为它们已被处理。我可以在 Python 中使用 Pool 来做到这一点吗？

标签： python memory multiprocessing

【解决方案1】：

现在，您在内存中保留了几个列表 - vector_field_x、vector_field_y、vector_components，然后在 map 调用期间将 vector_components 的单独副本（这是您实际用完的时候）记忆）。您可以通过使用pool.imap 而不是pool.apply_async 以及手动创建的列表来避免需要vector_components 列表的任一副本。 imap 返回一个迭代器而不是一个完整的列表，因此您永远不会在内存中拥有所有结果。

通常，pool.map 将传递给它的可迭代对象分解为块，并将这些块发送给子进程，而不是一次发送一个元素。这有助于提高性能。因为imap 使用迭代器而不是列表，所以它不知道您传递给它的可迭代对象的完整大小。在不知道可迭代的大小的情况下，它不知道每个块有多大，因此它默认为 1 的 chunksize，这将起作用，但可能不会表现得那么好。为避免这种情况，您可以为其提供一个好的chunksize 参数，因为您知道可迭代的元素长度为sample。它可能对您的 500 个元素列表没有太大影响，但值得尝试。

这里有一些示例代码来演示这一切：

import multiprocessing
from functools import partial


def vector_field_decomposer(x_dim, y_dim, x_steps, y_steps, vector_fields):
    vector_field_x_i = vector_fields[0]
    vector_field_y_i = vector_fields[1]
    # Do whatever is normally done here.


if __name__ == "__main__":
    num_workers = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(num_workers)
    # Calculate a good chunksize (based on implementation of pool.map)
    chunksize, extra = divmod(samples // 4 * num_workers)
    if extra:
        chunksize += 1

    # Use partial so many arguments can be passed to vector_field_decomposer
    func = partial(vector_field_decomposer, x_dim, y_dim, x_steps, y_steps)
    # We use a generator expression as an iterable, so we don't create a full list.
    results = pool.imap(func, 
                        ((vector_field_samples_x[s], vector_field_samples_y[s]) for s in xrange(samples)),
                        chunksize=chunksize)
    for vector in results:
        CsvH.write_vector_field(vector_component, 
                                '../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
    pool.close()
    pool.join()

这应该可以让您避免MemoryError 问题，但如果没有，您可以尝试在总样本的较小块上运行imap，然后只执行多次。不过，我认为您不会有任何问题，因为除了您开始使用的 vector_field_* 列表之外，您没有构建任何其他列表。

【讨论】：