【问题标题】:Dask: Read hdf5 and write to other hdf5 fileDask:读取 hdf5 并写入其他 hdf5 文件
【发布时间】:2022-07-07 20:24:12
【问题描述】:

我正在处理一个大于内存的 hdf5 文件。因此,我正在尝试使用 dask 对其进行修改。我的目标是加载文件,进行一些修改(不一定保留形状),并将其保存到其他文件中。我创建我的文件:

import h5py as h5
import numpy as np

source_file = "source.hdf5"
x = np.zeros((3, 3))  # In practice, x will be larger than memory
with h5.File(source_file, "w") as f:
    f.create_dataset("/x", data=x, compression="gzip")

然后,我使用下面的代码来加载、修改和保存它。

from dask import array as da
import h5py as h5
from dask.distributed import Client


if __name__ == "__main__":
    dask_client = Client(n_workers=1)  # No need to parallelize, just interested in dask for memory-purposes

    source_file = "source.hdf5"
    temp_filename = "target.hdf5"

    # Load dataframe
    f = h5.File(source_file, "r")
    x_da = da.from_array(f["/x"])

    # Do some modifications
    x_da = x_da * 2

    # Save to target
    x_da.to_hdf5(temp_filename, "/x", compression="gzip")

    # Close original file
    f.close()

但是,这会产生以下错误:

TypeError: ('Could not serialize object of type Dataset.', '<HDF5 dataset "x": shape (3, 3), type "<f8">') distributed.comm.utils - ERROR - ('Could not serialize object of type Dataset.', '<HDF5 dataset "x": shape (3, 3), type "<f8">')

是我做错了什么,还是这根本不可能?如果是这样,是否有一些解决方法?

提前致谢!

【问题讨论】:

    标签: python dask hdf5 h5py


    【解决方案1】:

    对于任何感兴趣的人,我创建了一个解决方法,它只需在每个块上调用 compute()。只是分享一下,虽然我仍然对更好的解决方案感兴趣。

    def to_hdf5(x, filename, datapath):
        """
        Appends dask array to hdf5 file
        """
        with h5.File(filename, "a") as f:
            dset = f.require_dataset(datapath, shape=x.shape, dtype=x.dtype)
    
            for block_ids in product(*[range(num) for num in x.numblocks]):
                pos = [sum(x.chunks[dim][0 : block_ids[dim]]) for dim in range(len(block_ids))]
                block = x.blocks[block_ids]
                slices = tuple(slice(pos[i], pos[i] + block.shape[i]) for i in range(len(block_ids)))
                dset[slices] = block.compute()
    

    【讨论】:

      猜你喜欢
      • 2017-05-13
      • 1970-01-01
      • 2021-11-02
      • 2017-02-21
      • 2017-11-26
      • 2017-09-23
      • 2016-04-26
      • 2023-03-07
      • 2020-07-24
      相关资源
      最近更新 更多