【发布时间】:2016-12-03 09:57:34
【问题描述】:
我有一个 40GB 的 CSV 文件,我必须再次将不同的列子集作为 CSV 输出,并检查数据中是否没有 NaNs。我选择使用 Pandas,我的实现的一个最小示例如下所示(在函数 output_different_formats 内):
# column_names is a huge list containing the column union of all the output
# column subsets
scen_iter = pd.read_csv('mybigcsv.csv', header=0, index_col=False,
iterator=True, na_filter=False,
usecols=column_names)
CHUNKSIZE = 630100
scen_cnt = 0
output_names = ['formatA', 'formatB', 'formatC', 'formatD', 'formatE']
# column_mappings is a dictionary mapping the output names to their
# respective column subsets.
while scen_cnt < 10000:
scenario = scen_iter.get_chunk(CHUNKSIZE)
if scenario.isnull().values.any():
# some error handling (has yet to ever occur)
for item in output_names:
scenario.to_csv(item, float_format='%.8f',
columns=column_mappings[item],
mode='a', header=True, index=False, compression='gzip')
scen_cnt+=100
我认为这在内存方面是安全的,因为我使用 .get_chunk() 以块的形式迭代文件,并且从不一次将整个 CSV 放入 DataFrame 中,只是将下一个块附加到每个相应文件的末尾。
但是在输出生成大约 3.5 GB 时,我的程序在 .to_csv 行中出现以下 MemoryError 崩溃,并带有以以下结尾的长 Traceback
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
为什么我会在这里收到 MemoryError?我的程序中是否有内存泄漏或者我误解了什么?或者程序是否会被煽动,随机地为该特定块写入 CSV 失败,也许我应该考虑减少块大小?
完整追溯:
Traceback (most recent call last):
File "D:/AppData/A/MRM/Eric/output_formats.py", line 128, in <module>
output_different_formats(real_world=False)
File "D:/AppData/A/MRM/Eric/output_formats.py", line 50, in clocked
result = func(*args, **kwargs)
File "D:/AppData/A/MRM/Eric/output_formats.py", line 116, in output_different_formats
mode='a', header=True, index=False, compression='gzip')
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 1188, in to_csv
decimal=decimal)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\format.py", line 1293, in __init__
self.obj = self.obj.loc[:, cols]
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1187, in __getitem__
return self._getitem_tuple(key)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 720, in _getitem_tuple
retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1323, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 966, in _getitem_iterable
result = self.obj.reindex_axis(keyarr, axis=axis, level=level)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 2519, in reindex_axis
fill_value=fill_value)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1852, in reindex_axis
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1876, in _reindex_with_indexers
copy=copy)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3157, in reindex_indexer
indexer, fill_tuple=(fill_value,))
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3238, in _slice_take_blocks_ax0
new_mgr_locs=mgr_locs, fill_tuple=None))
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 853, in take_nd
allow_fill=False)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
【问题讨论】:
-
也许您可以尝试在循环中调用垃圾收集器 (
gc.collect())。作为一种解决方法,您还可以尝试 64 位版本的 python。 -
@Jean-FrançoisFabre 现在尝试使用
gc.collect(),不知道再过几个小时是否成功。为什么 64 位 Python 会有所帮助? -
64 位 python 允许更多的内存分配(当然,您需要系统上的物理内存/交换和 64 位窗口)。这不会修复内存泄漏,但会延迟它,希望直到您的程序终止。
-
@Jean-FrançoisFabre 我明白了。如果
gc.collect()解决了,我会告诉你的,谢谢你的帮助!
标签: python python-3.x pandas memory