集群中多个节点的 Julia 并行计算答案

【问题标题】：Julia parallel computing over multiple nodes in cluster集群中多个节点的 Julia 并行计算
【发布时间】：2017-08-22 02:28:04
【问题描述】：

我正在共享集群上运行一些作业，并且我一直在尝试一次使用多个节点。虽然使用 julia -p #processors 可用于一个节点上的核心，但它不会找到其他节点。集群使用 SGE，我尝试了很多不同的方法来使节点工作，但只有一种工作。 Julia 中是否有一种简单的方法可以使用 julia -mpi 32 或类似的东西启动 Julia？
使用

using ClusterManagers
println(nworkers(),nprocs(),Sys.CPU_CORES)
ClusterManagers.addprocs_sge(16)
ClusterManagers.addprocs_sge(15)
println(nworkers(),nprocs(),Sys.CPU_CORES)

不起作用（我在 SGE 上提交了一个保留 2 个节点的作业，每个节点有 16 个核心），作业的输出文件是空的，而是得到 16 个不同的输出文件julia-70755.o8252776.*(* = 1...16)，内容如下文字：

julia_worker:9009#192.168.17.206
Master process (id 1) could not connect within 60.0 seconds.
exiting.

使用 julia --machinefile $PE_HOSTFILE 启动 Julia 也失败了：

Warning: Permanently added the RSA host key for IP address '192.168.18.10' to th
e list of known hosts.
ERROR: connect: invalid argument (EINVAL)
 in uv_error at ./libuv.jl:68 [inlined]
 in connect!(::TCPSocket, ::IPv4, ::UInt16) at ./socket.jl:652
 in connect!(::TCPSocket, ::SubString{String}, ::UInt16) at ./socket.jl:688
 in connect at ./stream.jl:959 [inlined]
 in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:483
 in connect(::Base.SSHManager, ::Int64, ::WorkerConfig) at ./managers.jl:425
 in create_worker(::Base.SSHManager, ::WorkerConfig) at ./multi.jl:1786
 in setup_launched_worker(::Base.SSHManager, ::WorkerConfig, ::Array{Int64,1}) a
t ./multi.jl:1733
 in (::Base.##669#673{Base.SSHManager,Array{Int64,1}})() at ./task.jl:360
 in sync_end() at ./task.jl:311
 in macro expansion at ./task.jl:327 [inlined]
 in #addprocs_locked#665(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./mul
ti.jl:1688
 in (::Base.#kw##addprocs_locked)(::Array{Any,1}, ::Base.#addprocs_locked, ::Bas
e.SSHManager) at ./<missing>:0
 in #addprocs#664(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./multi.jl:1
658
 in (::Base.#kw##addprocs)(::Array{Any,1}, ::Base.#addprocs, ::Base.SSHManager) 
at ./<missing>:0
 in #addprocs#764(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{An
y,1}) at ./managers.jl:112
 in process_options(::Base.JLOptions) at ./client.jl:227
 in _start() at ./client.jl:321
UndefRefError()

有人建议我使用 MPI.jl 包，但在我看来它并不真正支持 julia 并行语法，就像我通过编写 @ 来使用它一样在我想要并行运行的 for 循环之前同步 @parallel（即 Metropolis-Montecarlo）。

【问题讨论】：

谷歌的第一次点击给了我（希望）你的问题的答案：stochasticlifestyle.com/…
是的，我博客文章中的 machinefile 方法适用于此。我确切地展示了如何在 SGE 集群上执行此操作。您可能需要了解您的集群如何命名其机器文件...但是如果您启动 MPI 作业，它会显示出来。
@ChrisRackauckas Nope julia --machinefile jbtest-pe_hostfile_mpich.$JOB_ID test.jl 无效，因为它返回了ERROR: SystemError: opening file /usr4/spclpgm/opfeffer/annealing/jbtest-pe_host file_mpich.8279159: No such file or directory
顺便说一句：我不能使用#$ -pe mpich 128，因为：Unable to run job: job rejected: the requested parallel environment "mpich" does not exist.我使用mpi_16_tasks_per_node #NCORES
你得到的错误是因为你有错误的路径。不同的集群把机器文件放在不同的地方。您需要找出它在哪里命名。启动 MPI 作业并 ssh 进入节点，然后四处寻找文件。很可能在 cwd 上工作，只是名称不同。我给出的命令通常是正确的方向，但需要根据集群的具体情况进行更新。 mpich 就是另一个例子。

标签： parallel-processing cluster-computing julia

【解决方案1】：

IT 团队回复我并告诉我 SGE 不允许无密码 ssh，这就是为什么 addprocs_sge() 不起作用。然而，他们现在为我可以传递给 Julia 的作业添加了一个文件，并告诉我使用此脚本运行该作业：

qlogin -pe mpi_28_tasks_per_node 56
module load julia/0.5.1
julia --machinefile $TMPDIR/machines

机器文件如下所示：

::::::::::::::
/scratch/8548498.1.u/machines
::::::::::::::
{hostname1}
{hostname1}
...
{hostname2}
{hostname2}

【讨论】：

【解决方案2】：

您可能想阅读有关并行计算的 julia 文档，其中有关于集群管理器的部分。另外，请查看支持 SGE 的 ClusterManagers.jl：

julia> using ClusterManagers
julia> ClusterMangers.addprocs_sge(5)

【讨论】：