读取和写入 40GB CSV 时出现 MemoryError ......我的泄漏在哪里？答案

【问题标题】：MemoryError while reading and writing a 40GB CSV... where is my leak?读取和写入 40GB CSV 时出现 MemoryError ......我的泄漏在哪里？
【发布时间】：2016-12-03 09:57:34
【问题描述】：

我有一个 40GB 的 CSV 文件，我必须再次将不同的列子集作为 CSV 输出，并检查数据中是否没有 NaNs。我选择使用 Pandas，我的实现的一个最小示例如下所示（在函数 output_different_formats 内）：

# column_names is a huge list containing the column union of all the output
#  column subsets
scen_iter = pd.read_csv('mybigcsv.csv', header=0, index_col=False,
                        iterator=True, na_filter=False,
                        usecols=column_names)
CHUNKSIZE = 630100
scen_cnt = 0
output_names = ['formatA', 'formatB', 'formatC', 'formatD', 'formatE']
# column_mappings is a dictionary mapping the output names to their
#  respective column subsets. 
while scen_cnt < 10000:
    scenario = scen_iter.get_chunk(CHUNKSIZE)
    if scenario.isnull().values.any():
        # some error handling (has yet to ever occur)
    for item in output_names:
        scenario.to_csv(item, float_format='%.8f',
                        columns=column_mappings[item],
                        mode='a', header=True, index=False, compression='gzip')

    scen_cnt+=100

我认为这在内存方面是安全的，因为我使用 .get_chunk() 以块的形式迭代文件，并且从不一次将整个 CSV 放入 DataFrame 中，只是将下一个块附加到每个相应文件的末尾。

但是在输出生成大约 3.5 GB 时，我的程序在 .to_csv 行中出现以下 MemoryError 崩溃，并带有以以下结尾的长 Traceback

  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

为什么我会在这里收到 MemoryError？我的程序中是否有内存泄漏或者我误解了什么？或者程序是否会被煽动，随机地为该特定块写入 CSV 失败，也许我应该考虑减少块大小？

完整追溯：

Traceback (most recent call last):
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 128, in <module>
    output_different_formats(real_world=False)
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 50, in clocked
    result = func(*args, **kwargs)
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 116, in output_different_formats
    mode='a', header=True, index=False, compression='gzip')
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 1188, in to_csv
    decimal=decimal)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\format.py", line 1293, in __init__
    self.obj = self.obj.loc[:, cols]
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1187, in __getitem__
    return self._getitem_tuple(key)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 720, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1323, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 966, in _getitem_iterable
    result = self.obj.reindex_axis(keyarr, axis=axis, level=level)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 2519, in reindex_axis
    fill_value=fill_value)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1852, in reindex_axis
    {axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1876, in _reindex_with_indexers
    copy=copy)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3157, in reindex_indexer
    indexer, fill_tuple=(fill_value,))
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3238, in _slice_take_blocks_ax0
    new_mgr_locs=mgr_locs, fill_tuple=None))
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 853, in take_nd
    allow_fill=False)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

【问题讨论】：

也许您可以尝试在循环中调用垃圾收集器 (gc.collect())。作为一种解决方法，您还可以尝试 64 位版本的 python。
@Jean-FrançoisFabre 现在尝试使用gc.collect()，不知道再过几个小时是否成功。为什么 64 位 Python 会有所帮助？
64 位 python 允许更多的内存分配（当然，您需要系统上的物理内存/交换和 64 位窗口）。这不会修复内存泄漏，但会延迟它，希望直到您的程序终止。
@Jean-FrançoisFabre 我明白了。如果gc.collect() 解决了，我会告诉你的，谢谢你的帮助！

标签： python python-3.x pandas memory

【解决方案1】：

目前的解决方案是使用 gc.collect() 手动调用垃圾收集器

while scen_cnt < 10000:
    scenario = scen_iter.get_chunk(CHUNKSIZE)
    if scenario.isnull().values.any():
        # some error handling (has yet to ever occur)
    for item in output_names:
        scenario.to_csv(item, float_format='%.8f',
                        columns=column_mappings[item],
                        mode='a', header=True, index=False, compression='gzip')
        gc.collect()
    gc.collect()

添加这些行后内存消耗保持稳定，但是我仍然不清楚为什么这种方法存在内存问题。

【讨论】：

这似乎与内存泄漏无关，因为在这种情况下调用垃圾收集器对您没有帮助。在您使用的各种库中，内存分配很可能是隐式完成的。我会很惊讶作为图书馆用户，你可以做任何事情。
我很想了解图书馆用户在您的情况下应该做什么。 :) 但我无法想象您在这里造成了任何内存泄漏，您的库也没有。