如果 dtype 为“<u4”，如何在 python 3.6 中从 hdf5 文件中获取数据数组？答案

【问题标题】：How in python 3.6 to get data array from hdf5 file if dtype is "<u4"?如果 dtype 为“<u4”，如何在 python 3.6 中从 hdf5 文件中获取数据数组？
【发布时间】：2020-08-13 23:00:45
【问题描述】：

我想从 hdf5 文件中获取格式为 {N, 16, 512, 128} 的数据集作为 4D numpy 数组。 N 是多个具有 {16, 512, 128} 格式的 3D 数组。我尝试这样做：

import os
import sys
import h5py as h5
import numpy as np
import subprocess
import re

file_name = sys.argv[1]
path = sys.argv[2]

f = h5.File(file_name, 'r')
data = f[path]
print(data.shape) #{27270, 16, 512, 128}
print(data.dtype) #"<u4"

data = np.array(data, dtype=np.uint32)
print(data.shape)

不幸的是，在data = np.array(data, dtype=np.uint32) 命令之后似乎代码崩溃了，因为之后什么也没发生。

我需要将此数据集检索为一个 numpy 数组，或者可能是类似的 somthng 以进行进一步计算。如果您有任何建议，请告诉我。

【问题讨论】：

data[:] 产生什么？ h5py 也建议使用自己的astype：with dset.astype(...):out = dset[:]
@hpaulj，你能解释一下你的问题吗？当被问及data[:] 生产什么时，你是什么意思？当我打印data[:] 时发生了什么？在什么之后或之前？我做了prind(data)，结果是<HDF5 dataset: shape (27270, 16, 512, 128), type "<u4">
data 是一个 h5py 数据集对象。 data[:] 是一个 numpy 数组（通过首选但不仅限于语法）。
@hpaulj，好的，我明白了。同样，我不确定出了什么问题，但是没有发生任何事情，带有命令行的窗口没有改变
您可能不需要数组。 data = f[path] 是一个 h5py 数据集对象，其行为类似于数组。如果你真的需要一个数组，请使用data_arr = f[path][:]——它会返回一个数组。没有充分的理由使用np.array()。区别：数组必须适合内存。对象没有相同的内存需求。

标签： python arrays numpy hdf5 h5py

【解决方案1】：

原来你甚至不需要重塑。这是一个访问数据集然后切片以获取数组的示例。我认为这正是你想要的。

2020 年 4 月 30 日编辑
OP是关于uint32的。我最初的答案使用了一组浮点数。它展示了所需的行为。为了完整起见，我稍作修改以从 uint32 整数数组创建数据集。
注意：我使用了a0=100。它创建的 HDF5 文件对于浮点数为 840 MB，对于 uint32 为 416 MB。将 a0=27270 乘以 273。我没有足够的内存来一次性创建它。下面的代码显示了这个过程。

(注意：数据集是使用maxshape=None创建的，用于axis=0以允许扩展。如果您有兴趣测试更大的数据集，可以通过添加循环来修改此示例以创建更多数据并添加到末尾数据集。）

import numpy as np
import h5py

a0 = 27270
a0 = 100
a1= 16
a2 = 512
a3 = 128

f_arr = np.random.rand(a0*a1*a2*a3).reshape(a0, a1, a2, a3)
i_arr = np.random.randint(0,254, (a0, a1, a2, a3), dtype=np.uint32 )

with h5py.File('SO_61508870.h5', mode='w') as h5w:
     h5f.create_dataset('array1', data=i_arr, maxshape=(None, a1, a2, a3) )

with h5py.File('SO_61508870.h5', mode='r') as h5r:
     data_ds = h5r['array1']
     print ('dataset shape:', data_ds.shape)
     for i in range(5):
         sliced_arr = data_ds[i,:,:,:]
         print ('array shape:', sliced_arr.shape)

【讨论】：

大声笑，不，这是非常基本的 h5py 东西。我从不创建数组，除非我绝对肯定需要一个用于下游操作（通常是一个需要数组作为输入的 NumPy 函数）。

【解决方案2】：

我在编写/获取<u4 和np.uint32 时没有问题：

In [14]: import h5py                                                                                   
In [15]: f=h5py.File('u4.h5','w')                                                                      
In [16]: ds = f.create_dataset('data', dtype='<u4', shape=(10,))                                       
In [17]: ds                                                                                            
Out[17]: <HDF5 dataset "data": shape (10,), type "<u4">
In [18]: ds[:]                                                                                         
Out[18]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)
In [19]: ds[:] = np.arange(-5,5)                                                                       
In [20]: ds                                                                                            
Out[20]: <HDF5 dataset "data": shape (10,), type "<u4">
In [21]: ds[:]                                                                                         
Out[21]: array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4], dtype=uint32)
In [22]: np.array(ds, dtype='uint32')                                                                  
Out[22]: array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4], dtype=uint32)
In [23]: f.close()

您可能遇到了内存限制。尝试创建该大小的数组时出现内存错误：

In [24]: np.zeros((27270, 16, 512, 128),np.uint32);                                                    
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-24-2cfe704044b6> in <module>
----> 1 np.zeros((27270, 16, 512, 128),np.uint32);

MemoryError: Unable to allocate 107. GiB for an array with shape (27270, 16, 512, 128) and data type uint32

您可能仍然可以加载data 的切片，例如data[0:100].

【讨论】：

肯定有这样的数据集，但是如果你和我在帖子中提到的一样，你会尝试从 hdf5 文件中读取它，我想你会遇到同样的情况。当您尝试使用小型数据集时，它没有任何意义，因为它不是真实案例
喜欢分块？我会尝试，在这种情况下似乎更合理的方式