如何创建数据集的 h5py 数据集答案

【问题标题】：How to create h5py dataset of a dataset如何创建数据集的 h5py 数据集
【发布时间】：2018-01-05 02:58:04
【问题描述】：

我是 HDF5 的新手，我正在尝试创建一个包含三列的复合类型数据集：MD5、大小、另一个数据集。

我怎样才能做到这一点？

我尝试了以下代码：

import h5py
import numpy as np

dbfile = h5py.File("test.h5",'w')
dtype1 = h5py.Dataset('myset', (100,))
dtype2 = np.dtype([
    ('MD5', np.str_, 32),
    ('size', "i8"),
    ('timestep0', dtype1)
    ])
records = dbfile.create_dateset('records', (4,), rec_type)

我得到错误：

typeError: __init__() takes exactly 2 arguments (3 given)

参考线：

dtype1 = h5py.Dataset('myset', (100,))

【问题讨论】：

h5py.Dataset() 命令应该做什么？该用途在哪里记录？ docs.h5py.org/en/latest/high/dataset.html#creating-datasets
我正在尝试定义一个类型数据集。我认为 h5py.Dataset 会做到这一点。部分问题是我在文档中找不到如何操作。
我不明白。用numpy 数组说明，或HDF5 引用您正在尝试做的事情。
这是stackoverflow.com/questions/57667412/…的副本

标签： python numpy hdf5 h5py

【解决方案1】：

h5py.Dataset('myset', (100,)) 尝试直接创建一个dataset 对象（调用它是__init__？）。但根据参考：

http://docs.h5py.org/en/latest/high/dataset.html#reference

class Dataset(identifier)

Dataset objects are typically created via Group.create_dataset(), or by
retrieving existing datasets from a file. Call this constructor to
create a new Dataset bound to an existing DatasetID identifier.

即使你能得到这样一个对象（我仍然不明白），它也不会在np.dtype 中工作。例如，如果我将其替换为 datetime.datetime 对象，则结果为 dtype='O'

In [503]: dtype2 = np.dtype([
     ...:     ('MD5', np.str_, 32),
     ...:     ('size', "i8"),
     ...:     ('timestep0', datetime.datetime)
     ...:     ])

In [504]: dtype2
Out[504]: dtype([('MD5', '<U32'), ('size', '<i8'), ('timestep0', 'O')])

在numpy dytes 中定义了字符串、整数和浮点数等定义的类型，以及object（不是列表、字典或其他 Python 类）。

我可以将复合数据类型保存到h5py，但不能保存对象数据类型。有一个 h5py dtype 被加载到 numpy 对象 dtype 中，但它通常不会向另一个方向工作。

http://docs.h5py.org/en/latest/special.html#variable-length-strings

hdf5 can't write numpy array of object type

http://docs.h5py.org/en/latest/refs.html - 对象引用

In [7]: import h5py
In [8]: f = h5py.File('wtihref.h5','w')
In [9]: ds0 = f.create_dataset('dset0',np.arange(10))
In [10]: ds1 = f.create_dataset('dset1',np.arange(11))
In [11]: ds2 = f.create_dataset('dset2',np.arange(12))
In [12]: ds2.ref
Out[12]: <HDF5 object reference>
In [13]: ref_dtype = h5py.special_dtype(ref=h5py.Reference)
In [14]: ref_dtype
Out[14]: dtype('O')
In [16]: rds = f.create_dataset('refdset', (5,), dtype=ref_dtype)
In [17]: rds[:3]=[ds0.ref, ds1.ref, ds2.ref]
In [28]: [f[r] for r in rds[:3]]
Out[28]: 
[<HDF5 dataset "dset0": shape (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), type "<f4">,
 <HDF5 dataset "dset1": shape (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), type "<f4">,
 <HDF5 dataset "dset2": shape (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11), type "<f4">]

使用复合数据类型

In [55]: dt2 = np.dtype([('x',int),('y','S12'),('z',ref_dtype)])
In [56]: rds1 = f.create_dataset('refdtype', (5,), dtype=dt2)
In [72]: rds1[0]=(0,b'ONE',ds0.ref)
In [75]: rds1[1]=(1,b'two',ds1.ref)
In [76]: rds1[2]=(2,b'three',ds2.ref)
In [82]: rds1[:3]
Out[82]: 
array([(0, b'ONE', <HDF5 object reference>),
       (1, b'two', <HDF5 object reference>),
       (2, b'three', <HDF5 object reference>)],
      dtype=[('x', '<i4'), ('y', 'S12'), ('z', 'O')])
In [83]: f[rds1[0]['z']]
Out[83]: <HDF5 dataset "dset0": shape (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), type "<f4">

h5py 使用dtype 的metadata 属性来存储参考信息：

In [84]: ref_dtype.metadata
Out[84]: mappingproxy({'ref': h5py.h5r.Reference})
In [85]: dt2.fields['z']
Out[85]: (dtype('O'), 16)
In [86]: dt2.fields['z'][0].metadata
Out[86]: mappingproxy({'ref': h5py.h5r.Reference})

【讨论】：