【问题标题】:Pytorch "NCCL error": unhandled system error, NCCL version 2.4.8"Pytorch“NCCL 错误”:未处理的系统错误,NCCL 版本 2.4.8”
【发布时间】:2020-07-19 08:54:57
【问题描述】:

我使用 pytorch 分布式训练我的模型。我有两个节点和每个节点两个 gpu,我为一个节点运行代码:

python train_net.py  --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml  --num-gpu 2  --num-machines 2 --machine-rank 0 --dist-url tcp://192.168.**.***:8000

和其他:

python train_net.py  --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml  --num-gpu 2  --num-machines 2 --machine-rank 1 --dist-url tcp://192.168.**.***:8000

但是对方有RuntimeError问题

global_rank 3 machine_rank 1 num_gpus_per_machine 2 local_rank 1
global_rank 2 machine_rank 1 num_gpus_per_machine 2 local_rank 0
Traceback (most recent call last):
  File "train_net.py", line 109, in <module>
    args=(args,),
  File "/root/detectron2_repo/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/root/detectron2_repo/detectron2/engine/launch.py", line 72, in _distributed_worker
    comm.synchronize()
  File "/root/detectron2_repo/detectron2/utils/comm.py", line 79, in synchronize
    dist.barrier()
  File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1489, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8

如果我将mask-rank = 1改为mask-rank = 0,那么不会报错,但是不能分布式训练,有谁知道为什么会出现这个错误?

【问题讨论】:

  • 嘿,我遇到了同样的错误,你是怎么解决的?
  • 我遇到了类似的错误,但是RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096246/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8,你是怎么解决的?
  • 这个pytorch.org/docs/stable/… 可能会有所帮助
  • 如何在命令行查看nccl版本?

标签: python pytorch


【解决方案1】:

许多因素都可能导致此问题,例如,请参阅 12。添加行

import os
os.environ["NCCL_DEBUG"] = "INFO"

到您的脚本将记录导致错误的更具体的调试信息,为您提供更有用的错误消息给谷歌。

【讨论】:

  • 如何在命令行查看nccl版本?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-12-25
  • 2021-07-03
  • 2021-12-10
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多