在并行 HDF5 中创建可扩展数据集答案

【问题标题】：Creating extendable datasets in parallel HDF5在并行 HDF5 中创建可扩展数据集
【发布时间】：2015-11-13 02:10:55
【问题描述】：

我正在尝试将数据并行写入 hdf5 文件。每个节点都有自己的数据集，这是唯一的（尽管它们的大小相同）。我正在尝试将它们全部写入并行的 hdf5 文件中的单独数据集中。问题是，稍后我可能想用不同大小的数据集覆盖它们（与原始数据集相比大小不同——每个处理器上的数据集大小相同）。有人知道怎么做吗？

（代码依赖于 boost 和 Eigen）

我有代码首先打开文件：

boost::mpi::environment env(argc, argv);

// set up the info for HDF5 and MPI
MPI_Comm comm = MPI_COMM_SELF;
MPI_Info info = MPI_INFO_NULL;

// Set up file access property list with parallel I/O access
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);

// declare the file ID
std::string filename = "test.h5";

// create a file
hid_t fileID = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);

// close the property list
H5Pclose(plist_id);

然后创建和写入数据集：

// get the mpi communicator
unique_ptr<mpi::communicator> worldComm(new mpi::communicator);

const Eigen::VectorXd dataset = worldComm->rank() * Eigen::VectorXd::Ones(3);
const std::string name = "/vector";

// sleep for a bit so the processors are doing something different
sleep(worldComm->rank() * 2.0);

// the sizes of the data set
const hsize_t dimsf[2] = {(hsize_t)dataset.rows(), (hsize_t)dataset.cols()};

// set the maximum size of the data set to be unlimited
const hsize_t maxdim[2] = {H5S_UNLIMITED, H5S_UNLIMITED};

// the size of each chuck --- is there a better way to choose these numbers!?!?!?!
const hsize_t chunkDims[2] = {2, 5};

// create the dataspace for the dataset.
const hid_t filespace = H5Screate_simple(2, dimsf, maxdim); 
assert(filespace>0);

// modify data set creation properties --- enable chunking
const hid_t prop = H5Pcreate(H5P_DATASET_CREATE);
const hid_t status = H5Pset_chunk(prop, 2, chunkDims);

// create the dataset with default properties for each process
std::vector<hid_t> dsetVec(worldComm->size());
for( int i=0; i<worldComm->size(); ++i ) {
  const std::string datasetName = name+"_rank_"+std::to_string(i);

  dsetVec[i] = H5Dcreate2(fileID, datasetName.c_str(), H5T_NATIVE_DOUBLE, filespace, H5P_DEFAULT, prop, H5P_DEFAULT);
}

// Create property list for dataset write.
const hid_t plistID = H5Pcreate(H5P_DATASET_XFER);

// write the data to file
H5Dwrite(dsetVec[worldComm->rank()], H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL, plistID, dataset.data());

// close the filespace 
H5Sclose(filespace);

// close the datasets
for( int i=0; i<worldComm->size(); ++i ) {
  H5Dclose(dsetVec[i]);
}

// close the file
H5Fclose(fileID);

我期望的是四个名为“/vector_rank_i”（i=0,1,2,3）的数据集，每个数据集的大小为 3，值为 [0, 0, 0], [1, 1, 1], [2, 2, 2] 和 [3, 3, 3]，分别。但是，正在生成的是四个名为“/vector_rank_i”（i=0,1,2,3）的数据集，每个数据集大小为 3，但值为 [0,0,0]、[0,0,0]、[ 0, 0, 0] 和 [3, 3, 3]。

如果我不使用分块，这个确切的代码可以完美运行。但是，由于我以后需要能够扩展数据集，所以这不太理想。有没有人知道一个好的解决方法？

【问题讨论】：

标签： c++ boost parallel-processing mpi hdf5

【解决方案1】：

在回答您的具体代码之前，我想进一步了解为什么“每个进程一个数据集”是您选择分解问题的方式。如果您要扩展到少数几个流程之外，这似乎是一团糟。

您正在对数据集执行并行 I/O，并且您已启用 MPI-IO 但未启用集体访问。这不太可能在规模上产生非常好的性能。

对我来说，你的块暗淡似乎真的很小。我会让它们更大，但“有多大”取决于很多因素。好吧，看看这些值的性能如何。开启集体 I/O 或许不会那么糟糕？

抛开那些最初的印象，也许您只是想尝试一下 HDF5。我不知道为什么打开分块会使某些数据集为空……除非您正在写入 NFS。如果您正在写信给 NFS，那么，祝您好运，伙计，但这是没有希望的。

【讨论】：