Fortran 流访问与 MPI-IO 的效率答案

【问题标题】：Efficiency of Fortran stream access vs. MPI-IOFortran 流访问与 MPI-IO 的效率
【发布时间】：2020-12-23 23:30:44
【问题描述】：

我有一个并行的代码部分，我在块中写出 n 个大型数组（代表一个数字网格），这些块稍后会以不同大小的块读取。为此，我使用了 Stream 访问，因此每个处理器独立写入其块，但在本节测试 2 个处理器组时，我发现时间不一致，耗时 0.5-4 秒。

我知道你可以用 MPI-IO 做类似的事情，但我不确定有什么好处，因为不需要同步。我想知道是否有办法提高我的写入性能，或者 MPI-IO 是否有理由成为本节的更好选择。

这是代码部分的示例，我在其中创建文件以使用两组（mygroup = 0 或 1）写入 norb 数组：

do irbsic=1,norb
  [various operations]

  blocksize=int(nmsh_tot/ngroups)
  OPEN(unit=iunit,FILE='ZPOT',STATUS='UNKNOWN',ACCESS='STREAM')
  mypos = 1 + (IRBSIC-1)*nmsh_tot*8     ! starting point for writing IRBSIC
  mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
  WRITE(iunit,POS=mypos) POT(1:nmsh)  
  CLOSE(iunit)

  OPEN(unit=iunit,FILE='RHOI',STATUS='UNKNOWN',ACCESS='STREAM')
  mypos = 1 + (IRBSIC-1)*nmsh_tot*8     ! starting point for writing IRBSIC
  mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
  WRITE(iunit,POS=mypos) RHOG(1:nmsh,1,1)
  CLOSE(iunit)

  [various operations]
end do

【问题讨论】：

是否有多个进程写入给定文件？如果是这样，我强烈推荐 MPI I/O - 如果你不这样做，你可能会得到不正确的结果，这是我遇到的一个讨厌的问题
如果您正在写入不同的文件，这意味着您有不同的单元号，那么您可以使用ASYNCHRONOUS= "YES"。您的程序不会等待 IO 完成，因为它已将 IO 交给操作系统，而您现在受到文件系统的限制。
顺便说一句，为什么要计算两次mypos？而且，IRBSIC 是否应该是 do-loop 索引 iorbsrc？
@IanBush 是的，多个进程写入一个文件，但每个进程都在写入文件的不同部分。打开同一个文件是否仍然存在冲突？
如果多个进程正在写入文件，则不能保证 Fortran I/O 正常工作 - 这不仅仅是理论上的标准违规，我已经看到这个失败的生成文件部分填充了不可读的值。引用 Cray 工程师的话“对于多个进程写入文件的唯一明智、可移植的方式是通过 MPI I/O”

标签： io fortran mpi mpi-io

【解决方案1】：

（正如 cmets 中所讨论的）我强烈建议不要为此使用 Fortran 流访问。仅当文件被单个进程访问时，标准 Fortran I/O 才能保证工作，在我自己的工作中，当多个进程尝试一次写入文件时，我看到文件的随机损坏，即使进程正在写入到文件的不同部分。 MPI-I/O 或使用 MPI-I/O 的库（如 HDF5 或 NetCDF）是实现此目的的唯一明智方法。下面是一个简单的程序说明mpi_file_write_at_all的用法

ian@eris:~/work/stack$ cat at.f90
Program write_at

  Use mpi

  Implicit None

  Integer, Parameter :: n = 4

  Real, Dimension( 1:n ) :: a

  Real, Dimension( : ), Allocatable :: all_of_a
  
  Integer :: me, nproc
  Integer :: handle
  Integer :: i
  Integer :: error
  
  ! Set up MPI
  Call mpi_init( error )
  Call mpi_comm_size( mpi_comm_world, nproc, error )
  Call mpi_comm_rank( mpi_comm_world, me   , error )

  ! Provide some data
  a = [ ( i, i = n * me, n * ( me + 1 ) - 1 ) ]

  ! Open the file
  Call mpi_file_open( mpi_comm_world, 'stuff.dat', &
       mpi_mode_create + mpi_mode_wronly, mpi_info_null, handle, error )

  ! Describe how the processes will view the file - in this case
  ! simply a stream of mpi_real
  Call mpi_file_set_view( handle, 0_mpi_offset_kind, &
       mpi_real, mpi_real, 'native', &
       mpi_info_null, error )

  ! Write the data using a collective routine - generally the most efficent
  ! but as collective all processes within the communicator must call the routine
  Call mpi_file_write_at_all( handle, Int( me * n,mpi_offset_kind ) , &
       a, Size( a ), mpi_real, mpi_status_ignore, error )

  ! Close the file
  Call mpi_file_close( handle, error )

  ! Read the file on rank zero using Fortran to check the data
  If( me == 0 ) Then
     Open( 10, file = 'stuff.dat', access = 'stream' )
     Allocate( all_of_a( 1:n * nproc ) )
     Read( 10, pos = 1 ) all_of_a
     Write( *, * ) all_of_a
  End If

  ! Shut down MPI
  Call mpi_finalize( error )
  
End Program write_at
ian@eris:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ mpif90 -Wall -Wextra -fcheck=all -std=f2008 at.f90 
ian@eris:~/work/stack$ mpirun -np 2 ./a.out 
   0.00000000       1.00000000       2.00000000       3.00000000       4.00000000       5.00000000       6.00000000       7.00000000    
ian@eris:~/work/stack$ mpirun -np 5 ./a.out 
   0.00000000       1.00000000       2.00000000       3.00000000       4.00000000       5.00000000       6.00000000       7.00000000       8.00000000       9.00000000       10.0000000       11.0000000       12.0000000       13.0000000       14.0000000       15.0000000       16.0000000       17.0000000       18.0000000       19.0000000    
ian@eris:~/work/stack$

【讨论】：

感谢您的解释。看起来这将是我现在最终要走的路线。一个问题，mpi_file_set_view 是阻塞操作吗？对于我的使用，进程将在不同的时间到达。我看到有非阻塞版本的写入（mpi_file_iwrite_at），但我不知道如何处理 set_view。我可以在“irbsic”循环之外打开和关闭文件，但看起来 mpi_file_set_view 需要偏移量，因此必须在循环内。
来自 MPI 标准的第 13.3 节mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf“MPI_FILE_SET_VIEW 是集体的”。并非 mpi_file_open 和 mpi_file_close 也是集体例程。但是考虑到您上面的内容，我认为这不是一个大问题 - 只需使用 0 作为所有 procs 的偏移量，在文件中使用全局偏移量，然后在主计算之外打开、设置视图并关闭一次，所有这些都应该没关系，据我所知
另外，如果您对答案感到满意，请标记为正确 - 这不仅仅是我追求的声誉，它表明您不希望其他人对此有太多关注