【问题标题】：compressed files bigger in h5py在 h5py 中更大的压缩文件
【发布时间】：2016-01-04 20:01:54
【问题描述】：

我正在使用 h5py 从 python 中以 HDF5 格式保存 numpy 数组。最近，我尝试应用压缩，我得到的文件的大小更大......

我从这样的事情开始（每个文件都有几个数据集）

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, 
         dtype=float, data=estimated_pos)

这样的事情

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, dtype=float,
        data=estimated_pos, compression="gzip", compression_opts=9)

在特定示例中，压缩文件的大小为 172K，未压缩文件的大小为 72K（并且 h5diff 报告两个文件相等）。我尝试了一个更基本的示例，它按预期工作......但在我的程序中没有。

这怎么可能？我认为 gzip 算法不会提供更大的压缩文件，因此它可能与 h5py 及其使用有关：-/ 有什么想法吗？

干杯！！

编辑：

从h5stat 的输出来看，压缩版本似乎保存了很多元数据（在输出的最后几行）

压缩文件

Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3798/503
        Datasets(exclude compact data): 15904/9254
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 116824
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 33602
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 2
    Dataset layout counts[CHUNKED]: 54
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 2
        GZIP filter: 54
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 136526 bytes
  Raw data: 33602 bytes
  Unaccounted space: 5111 bytes
Total space: 175239 bytes

未压缩文件

Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3663/452
        Datasets(exclude compact data): 15904/10200
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 0
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 50600
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 56
    Dataset layout counts[CHUNKED]: 0
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 56
        GZIP filter: 0
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 19567 bytes
  Raw data: 50600 bytes
  Unaccounted space: 5057 bytes
Total space: 75224 bytes

【问题讨论】：

您是否将数组的压缩版本复制到一个新的 .hdf5 文件中，或者您是否尝试覆盖现有文件中的那些？ HDF5 没有释放未使用空间的机制，因此如果您在同一文件中制作每个数组的压缩副本，然后删除原始文件，您的文件大小可能会增加到原始文件的大小加上数组的压缩副本。在这种情况下，您可以使用h5repack 制作文件的新副本并回收未使用的空间。
我独立生成了这两个文件，所以我认为这与此无关。实际上，问题似乎是压缩文件主要是元数据：-/（请检查上面的编辑）

标签： python numpy compression hdf5 h5py

【解决方案1】：

首先，这是一个可重现的示例：

import h5py
from scipy.misc import lena

img = lena()    # some compressible image data

f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()

f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()

f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()

现在让我们看看文件大小：

~$ h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
  File metadata: 1304 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 840 bytes
Total space: 2099296 bytes

~$ h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 302850 bytes
  Unaccounted space: 1816 bytes
Total space: 316434 bytes

~$ h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2098560 bytes
  Unaccounted space: 1816 bytes
Total space: 2112144 bytes

在我的示例中，使用 gzip -9 进行压缩是有意义的 - 尽管它需要额外的 ~10kB 元数据，但图像数据大小减少 ~1794kB（压缩比约为 7:1 ）。最终结果是总文件大小减少了约 6.6 倍。

但是，在您的示例中，压缩仅将原始数据的大小减少了约 16kB（压缩比约为 1.5:1），这大大超过了元数据大小增加 116kB 的影响。元数据大小的增加比我的示例大得多的原因可能是因为您的文件包含 56 个数据集，而不仅仅是一个。

即使 gzip 神奇地将原始数据的大小减小到零，您最终仍会得到一个比未压缩版本大约 1.8 倍的文件。元数据的大小或多或少可以保证随着数组大小的亚线性缩放，因此如果您的数据集更大，那么您将开始看到压缩它们的一些好处。就目前而言，您的阵列是如此之小，以至于您不太可能从压缩中获得任何收益。

更新：

压缩版本需要更多元数据的原因实际上与压缩本身无关，而是因为要使用压缩过滤器，数据集需要为split into fixed-size chunks。大概有很多额外的元数据被用来存储索引块所需的B-tree。

f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()

f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there 
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()

f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
                  compression_opts=9)
f6.close()

以及生成的文件大小：

~$ h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 1816 bytes
Total space: 2110736 bytes

~$ h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 96 bytes
Total space: 2101168 bytes

~$ h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 305051 bytes
  Unaccounted space: 96 bytes
Total space: 309067 bytes

很明显，分块是导致额外元数据而不是压缩的原因，因为nocomp_autochunked.h5 包含与上面的complevel_0.h5 完全相同的元数据量，并且在complevel_9_onechunk.h5 中对分块版本引入压缩对元数据总量。

在此示例中，增加块大小以将数组存储为单个块将元数据量减少了大约 3 倍。在您的情况下，这会产生多大的差异可能取决于 h5py 如何为您的输入数据集自动选择块大小。有趣的是，这也导致压缩比略有降低，这不是我所预料的。

请记住，拥有更大的块也有一些缺点。每当您想访问块中的单个元素时，都需要将整个块解压缩并读入内存。对于大型数据集，这可能对性能造成灾难性影响，但在您的情况下，数组非常小，可能不值得担心。

您应该考虑的另一件事是您是否可以将数据集存储在单个数组中，而不是大量的小数组中。例如，如果您有 K 个相同 dtype 的 2D 数组，每个数组都有尺寸 MxN，那么您可以更有效地将它们存储在 KxMxN 3D 中数组而不是大量的小数据集。我对您的数据了解得不够多，不知道这是否可行。

【讨论】：

您可能是对的……尽管在我看来，添加压缩的开销（元数据）非常高。我只是运行一个示例，其中生成了更大的文件（通过添加更多相同大小的数据集），在这种情况下，压缩文件中的元数据约为 3M，而原始数据约为 1.6M。我天真地假设添加压缩相当于在保存每条数据之前应用 gzip（元数据是“algorithm=gzip”）......看起来，这不是那么简单
查看我的更新 - 根本问题是分块而不是压缩本身