【问题标题】:How to save a large pandas dataframe with compex arrays and load it up again?如何使用 compex 数组保存大型 pandas 数据帧并再次加载?
【发布时间】:2025-12-14 23:55:01
【问题描述】:

我有一个大型 pandas DataFrame,其中包含复杂的 numpy 数组的单个元素。请参阅下面的最小代码示例来重现该场景:


d = {f'x{i}': [] for i in range(4)}
df = pd.DataFrame(data=d).astype(object)

for K in range(4): 
    for i in range(4): 

        df.loc[f'{K}', f'x{i}'] = np.random.random(size=(2,2)) + np.random.random(size=(2,2)) * 1j

df

保存这些并再次加载以供以后使用的最佳方法是什么?

我遇到的问题是,当我增加存储的矩阵的大小和元素的数量时,当我尝试将其保存为.h5 文件时,我得到一个OverflowError,如下所示:

import pandas as pd 

size = (300,300)
xs = 1500

d = {f'x{i}': [] for i in range(xs)}
df = pd.DataFrame(data=d).astype(object)

for K in range(10): 
    for i in range(xs): 

        df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j

df.to_hdf('test.h5', key="df", mode="w")

load_test = pd.read_hdf("test.h5", "df")
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-124-8cb8df1a0653> in <module>
     12         df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
     13 
---> 14 df.to_hdf('test.h5', key="df", mode="w")
     15 
     16 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2447             data_columns=data_columns,
   2448             errors=errors,
-> 2449             encoding=encoding,
   2450         )
   2451 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    268             path_or_buf, mode=mode, complevel=complevel, complib=complib
    269         ) as store:
--> 270             f(store)
    271     else:
    272         f(path_or_buf)

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
    260             data_columns=data_columns,
    261             errors=errors,
--> 262             encoding=encoding,
    263         )
    264 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times)
   1127             encoding=encoding,
   1128             errors=errors,
-> 1129             track_times=track_times,
   1130         )
   1131 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
   1799             nan_rep=nan_rep,
   1800             data_columns=data_columns,
-> 1801             track_times=track_times,
   1802         )
   1803 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, **kwargs)
   3189             # I have no idea why, but writing values before items fixed #2299
   3190             blk_items = data.items.take(blk.mgr_locs)
-> 3191             self.write_array(f"block{i}_values", blk.values, items=blk_items)
   3192             self.write_index(f"block{i}_items", blk_items)
   3193 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write_array(self, key, value, items)
   3047 
   3048             vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
-> 3049             vlarr.append(value)
   3050 
   3051         elif empty_array:

~/PQKs/pqks/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
    526             nparr = None
    527 
--> 528         self._append(nparr, nobjects)
    529         self.nrows += 1
    530 

~/PQKs/pqks/lib/python3.6/site-packages/tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()

OverflowError: value too large to convert to int

【问题讨论】:

  • 可能重复/在此回答:*.com/a/57133759/8896855
  • 你可以用pickle把它保存为二进制文件。示例:docs.python.org/3/library/pickle.html#examples

标签: python pandas dataframe save numpy-ndarray


【解决方案1】:

正如在类似问题https://*.com/a/57133759/8896855 中所述,hdf/h5 文件具有更多开销,旨在优化保存到单个文件系统中的许多数据帧。羽毛和镶木地板对象可能会在保存/加载更大的单个数据帧作为内存对象方面提供更好的解决方案。就具体的溢出错误而言,这可能是由于在 pandas 的“object”类型中存储了较大的混合类型(作为 numpy 数组)列的结果。一个(更复杂的)选项是将数据框中的数组拆分为单独的列,但这可能是不必要的。

一般的快速解决方法是使用 df.to_pickle(r'path_to/filename.pkl'),但 to_feather 或 to_parquet 可能会提供更优化的解决方案。

【讨论】: