如何将熊猫数据框单元格中的列表保存为 HDF5 表格格式？答案

【问题标题】：How to save a list in a pandas dataframe cell to a HDF5 table format?如何将熊猫数据框单元格中的列表保存为 HDF5 表格格式？
【发布时间】：2022-11-19 02:38:41
【问题描述】：

我有一个数据框，我想以可附加格式保存到 hdf5 文件中。数据框如下所示：

    column1
0   [0, 1, 2, 3, 4]

复制该问题的代码是：

import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5))]})
test.to_hdf('test','testgroup',format="table")

不幸的是，它返回此错误：

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-65-c2dbeaca15df> in <module>
      1 test = pd.DataFrame({"column1":[list(range(0,5))]})
----> 2 test.to_hdf('test','testgroup',format="table")

7 frames

/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors, columns)
   4979                 error_column_label = columns[i] if len(columns) > i else f"No.{i}"
   4980                 raise TypeError(
-> 4981                     f"Cannot serialize the column [{error_column_label}]\n"
   4982                     f"because its data contents are not [string] but "
   4983                     f"[{inferred_type}] object dtype"

TypeError: Cannot serialize the column [column1]
because its data contents are not [string] but [mixed] object dtype

我知道我可以将每个值保存在单独的列中。这对我的扩展用例没有帮助，因为可能有可变长度的列表。

我知道我可以将列表转换为字符串，然后根据字符串重新创建它，但如果我开始将每一列转换为字符串，我还不如使用文本格式，如 csv，而不是二进制格式，如 hdf5。

是否有将列表保存为 hdf5 表格格式的标准方法？

【问题讨论】：

标签： python pandas dataframe hdf5 pytables

【解决方案1】：

Python 列表在写入 HDF5 时提出了挑战，因为它们可能包含不同的类型。例如，这是一个完全有效的列表：[1, 'two', 3.0]。另外，如果我了解你的 Pandas 'column1' 数据框，它可能包含不同长度的列表。没有（简单的）方法可以将其表示为 HDF5 数据集。 [这就是你收到[mixed] object dtype消息的原因。数据框的转换创建了一个中间对象，该对象被写为数据集。转换后的列表数据的dtype为“O”（object），HDF5不支持该类型。]

然而，一切并没有丢失。如果我们可以对您的数据做出一些假设，我们可以将其整理成 HDF5 数据集。假设：1) 所有 df 列表实体都是相同类型（在本例中为 int），以及 2) 所有 df 列表的长度相同。（我们可以处理不同长度的列表，但它更复杂。）此外，您将需要使用不同的包来写入 HDF5 数据（PyTables 或 h5py）。 PyTables 是 Pandas HDF5 支持的底层包，h5py 被广泛使用。这是你的选择。

在我发布代码之前，这里是过程的概述：

从数据帧创建一个 NumPy 记录数组（又名 recarray）
为 HDF5 数据集定义所需的类型和形状（作为 Pytables，或 h5py 的数据类型）。
使用上面的 Ataom/dtype 定义创建数据集（可以在 1 行上完成，但是这样更容易阅读）。

循环遍历 recarray 的行（来自步骤 1），并将数据写入行数据集。这会将 List 转换为等效数组。

创建 recarray 的代码（向数据框添加 2 行）：

import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})
# create recarray from the dataframe (use index='column1' to only get that column)
rec_arr = test.to_records(index=False)

PyTables 导出数据的具体代码：

import tables as tb
with tb.File('74489101_tb.h5', 'w') as h5f:
    # define "atom" with type and shape of column1 data
    df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )
    # create the dataset
    test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )
    # loop over recarray and polulate dataset
    for i in range(rec_arr.shape[0]):
        test[i] = rec_arr[i]['column1']
    print(test[:])

h5py导出数据的具体代码：

import h5py
with h5py.File('74489101_h5py.h5', 'w') as h5f:
    df_dt = (int,(len(rec_arr1[0]['column1']),))
    test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )
    for i in range(rec_arr1.shape[0]):
        test[i] = rec_arr1[i]['column1']
    print(test[:])

【讨论】：