h5py 写作：如何有效地将数百万个 .npy 数组写入 .hdf5 文件？答案

【问题标题】：h5py writing: How to efficiently write millions of .npy arrays to a .hdf5 file?h5py 写作：如何有效地将数百万个 .npy 数组写入 .hdf5 文件？
【发布时间】：2021-06-14 01:22:39
【问题描述】：

我必须将大图像的子样本存储为 .npy 大小为 (20,20,5) 的数组。为了在训练分类模型时进行均匀采样，我正在寻找一种有效的方法来存储近 1000 万个子样本，以允许这样做。

如果我将它们存储为整个图像，则训练期间的采样将不能代表分布。我有存储空间，但我会用完 inode 来尝试存储那么多“小”文件。 h5py / 写入 hdf5 文件是我的问题的自然答案，但是这个过程非常很慢。运行一个程序一天半的时间不足以编写所有子样本。我是 h5py 的新手，我想知道是不是写太多是造成这种情况的原因。

如果是这样，我不确定如何正确分块以避免不均匀采样的问题。每张图片都有不同数量的子样本（例如，一张图片可能是（20000,20,20,5），另一张可能是（32123,20,20,5）。

这是我用来将每个样本写入 .hdf5 的代码：

#define possible groups
groups=['training_samples','validation_samples','test_samples']

f = h5py.File('~/.../TrainingData_.hdf5', 'a', libver='latest')

此时我运行一个子采样函数，它返回一个大小为 (x,20,20,5) 的 NumPy 数组 trarray。

然后：

label = np.array([1])
for i in range(trarray.shape[0]):
   group_choice = random.choices(groups, weights = [65, 15, 20])
   subarr = trarray[i,:,:,:]

   if group_choice[0] == 'training_samples':
       training_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       training_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1
   elif group_choice[0] =='validation_samples':
       validation_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       validation_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1
   else:
       test_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       test_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1

我可以做些什么来改进这一点/在使用 h5py 方面，我正在做的事情是根本错误的吗？

【问题讨论】：

对于最佳的块大小和块形状，了解准确的读写模式非常重要。您还必须正确设置块缓存（默认的 1MB 通常太小）示例stackoverflow.com/a/48405220/4045774 根据存储系统，块大小对写入速度也有很大影响stackoverflow.com/a/44961222/4045774

标签： python numpy bigdata hdf5 h5py

【解决方案1】：

2021 年 3 月 22 日：请参阅下面提到的属性更新。
这是一个有趣的用例。我对上一个问题的回答涉及到这个问题（在我对这个问题的第一个回答中引用）。显然，写入大量小对象时的开销大于实际的写入过程。我很好奇，所以我创建了一个原型来探索写入数据的不同过程。

我的起始场景：

我创建了一个形状为 (NN,20,20,5) 的随机整数的 NumPy 数组。
然后我按照您的逻辑一次切片 1 行并分配为训练、验证或测试样本。
我将切片作为新数据集写入适当的组中。
我向组中添加了属性以引用每个数据集的切片 #。

主要发现：

将每个数组切片写入新数据集的时间在整个过程中保持相对恒定。
然而，随着属性 (NN) 数量的增加，写入时间呈指数增长。这在我最初的时候是不理解的邮政。对于较小的 NN (

每 1,000 个切片的增量写入时间表（不带和带属性）。（总时间乘以 NN/1000。）

Slice	Time (sec)	Time (sec)
Count	(w/out attrs)	(with attrs)
1_000	0.34	2.4
2_000	0.34	12.7
5_000	0.33	111.7
10_000	0.34	1783.3
20_000	0.35	n/a

显然使用属性不是保存切片索引的有效方法。相反，我捕获了数据集名称的一部分。这显示在下面的“原始”代码中。包含添加属性的代码，以备不时之需。

我创建了一个新流程，首先进行所有切片，然后分 3 步写入所有数据（训练、验证和测试样本各 1 步）。由于您无法从数据集名称中获取切片索引，因此我测试了 2 种不同的方法来保存该数据：1）作为每个“样本”数据集的第二个“索引”数据集和 2）作为组属性。两种方法都明显更快。将索引作为索引数据集编写对性能几乎没有影响。将它们写为属性要慢得多。数据：

所有切片的总写入时间表（不带和带属性）。

Slice	Time (secs)	Time (secs)	Time (secs)
Count	(no indices)	(index dataset)	(with attrs)
10_000	0.43	0.57	141.05
20_000	1.17	1.27	n/a

这种方法看起来是一种很有前途的方法，可以在合理的时间内将数据切片并写入 HDF5。您将不得不处理索引符号。

启动场景代码：

#define possible groups
groups=['training','validation','test']

# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
label = np.array([1])    

with h5py.File('TrainingData_orig.hdf5', 'w') as h5f :
#At this point I run a sub-sampling function that returns a NumPy array, 
#trarray, of size (x,20,20,5).
    for group in groups:
        h5f.create_group(group+'_samples')   
        h5f.create_group(group+'_labels')  
    
    time0 = timeit.default_timer()
    for i in range(trarray.shape[0]):
        group_choice = random.choices(groups, weights = [65, 15, 20])    

        h5f[group_choice[0]+'_samples'].create_dataset(f'ID-{i:04}', data=trarray[i,:,:,:])
        #h5f[group_choice[0]+'_labels'].create_dataset(f'ID-{i:04}', data=label)
        #h5f[group_choice[0]+'_samples'].attrs[f'ID-{i:04}'] = label

        if (i+1) % 1000 == 0:
            exe_time = timeit.default_timer() - time0          
            print(f'incremental time to write {i+1} datasets = {exe_time:.2f} secs')           
            time0 = timeit.default_timer()

测试场景代码：
注意：将属性写入组的调用已被注释掉。

#define possible groups
groups=['training_samples','validation_samples','test_samples']

# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
training   = np.empty(trarray.shape,dtype=np.int32)
validation = np.empty(trarray.shape,dtype=np.int32)
test       = np.empty(trarray.shape,dtype=np.int32)

indx1, indx2, indx3 = 0, 0, 0
training_list = []
validation_list = []
test_list = []

training_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
validation_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
test_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)

start = timeit.default_timer()

#At this point I run a sub-sampling function that returns a NumPy array, 
#trarray, of size (x,20,20,5).
for i in range(trarray.shape[0]):
    group_choice = random.choices(groups, weights = [65, 15, 20])   
    if group_choice[0] == 'training_samples':
        training[indx1,:,:,:] = data=trarray[i,:,:,:]
        training_list.append( (f'ID-{indx1:04}', i) )
        training_idx[indx1,:]= [indx1,i]
        indx1 += 1
    elif group_choice[0] == 'validation_samples':
        validation[indx2,:,:,:] = data=trarray[i,:,:,:]
        validation_list.append( (f'ID-{indx2:04}', i) )
        validation_idx[indx2,:]= [indx2,i]
        indx2 += 1
    else:
        test[indx3,:,:,:] = data=trarray[i,:,:,:]
        test_list.append( (f'ID-{indx3:04}', i) )
        test_idx[indx3,:]= [indx3,i]
        indx3 += 1


with h5py.File('TrainingData1_.hdf5', 'w') as h5f :
    
    h5f.create_group('training')
    h5f['training'].create_dataset('training_samples', data=training[0:indx1,:,:,:])
    h5f['training'].create_dataset('training_indices', data=training_idx[0:indx1,:])
    # for label, idx in training_list:
    #     h5f['training']['training_samples'].attrs[label] = idx

    h5f.create_group('validation')
    h5f['validation'].create_dataset('validation_samples', data=validation[0:indx2,:,:,:])
    h5f['validation'].create_dataset('validation_indices', data=validation_idx[0:indx2,:])
    # for label, idx in validation_list:
    #     h5f['validation']['validation_samples'].attrs[label] = idx

    h5f.create_group('test')
    h5f['test'].create_dataset('test_samples', data=test[0:indx3,:,:,:])
    h5f['test'].create_dataset('test_indices', data=test_idx[0:indx3,:])
    # for label, idx in test_list:
    #     h5f['test']['test_samples'].attrs[label] = idx

exe_time = timeit.default_timer() - start          
print(f'Write time for {trarray.shape[0]} images slices = {exe_time:.2f} secs')

【讨论】：

【解决方案2】：

分块存储旨在优化超大型数据集的 I/O。你的数据集是 (1,20,20,5)，对吧？如果是这样，那就太小了（在 HDF5 世界中），所以我认为分块不会有帮助。

如果我理解，您将根据trarray.shape[0] 的大小为每个子样本创建一个新数据集（提供 20,000 到 32,123 个子样本 - 您的循环长度）。这是很多个人写的。

几年前我进行了一些 I/O 测试，发现 h5py（和 PyTables）的写入性能受 I/O 操作的数量支配，而不是正在写入的数据集的大小。看看这个答案：pytables writes much faster than h5py. Why? 它比较了使用不同大小的 I/O 数据块写入相同总数据量时的 I/O 性能（对于 h5py 和 PyTables）。第一个关键发现在这里适用：写入所有数据的总时间是循环次数的线性函数（对于 PyTables 和 h5py）。

提高运行时间的方法是减少 I/O 循环的数量。一些想法：

有没有一种方法可以收集训练、验证和测试 NumPy 数组中的样本，然后将所有样本一次性写入单个数据集？
如果没有，您能否调整大小并创建 3 个空数据集（用于训练、验证、测试），然后将每个循环中的数据写入适当的数据集和索引？这可能会节省时间，因为您只是在编写而不是分配。（需要测试才能确定）。

【讨论】：

谢谢您-我认为您将数据重写为子样本数的线性函数是正确的。我不能创建太大的 numpy 数组，而不是可以存储在内存中，这就是我写入 .hdf5 的原因。不幸的是，第三张图片的操作时间约为 30 分钟，所以这是不可行的。
这是另一个减少 I/O 操作的想法。创建 3 个数据集（训练、验证、测试）以及 1 个附加数据集，其中 HDF5 区域引用到适当的集。您可以分配数据集以保存所有数据并避免重新分配开销。然后遍历您的数据并堆叠“大量”样本并写入数据集（无论您可以在 RAM 中保存什么）。有了更多细节，我想我可以使用虚拟数据（一个随机的 Numpy 数组）创建一个原型。