【发布时间】:2021-03-25 20:28:40
【问题描述】:
我看到了多个关于以下方面的问题:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
但似乎没有人能帮我解决这个问题:
- https://github.com/pytorch/pytorch/issues/54550
- https://github.com/pytorch/pytorch/issues/47885
- https://github.com/pytorch/pytorch/issues/50921
- https://github.com/pytorch/pytorch/issues/54823
我尝试在每个脚本的开头手动执行torch.cuda.set_device(device)。这似乎对我不起作用。我尝试过不同的 GPU。我试过降级 pytorch 版本和 cuda 版本。 1.6.0、1.7.1、1.8.0 和 cuda 10.2、11.0、11.1 的不同组合。我不确定还能做什么。人们做了什么来解决这个问题?
也许非常相关?
更完整的错误信息:
('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2
opts.world_size=2
ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
rank=0
mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>
os.getpid()=30704
setting up rank=0 (with world_size=2)
MASTER_ADDR='127.0.0.1'
59264
backend='nccl'
--> done setting up rank=0
setup process done for rank=0
Traceback (most recent call last):
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
main_distributed()
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train
tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
【问题讨论】:
-
既然您有足够的代表来查看已删除的答案,我假设您已经尝试过
export NCCL_IB_DISABLE=1? -
@jodag 刚刚旋转了我为重现此错误而制作的 conda env。是的,它仍然会产生该错误,不幸的是即使使用
export NCCL_IB_DISABLE=1。已删除的答案之一还暗示了一些关于export NCCL_SOCKET_IFNAME=<YOUR_IFACE>的信息,但我不知道这意味着什么或如何获得<YOUR_IFACE>。如果您知道,请告诉我,我也可以尝试,并希望通过答案和确切的工作场景来结束这些问题(让人们知道它何时为我解决了)。 -
我相信如果你在
dist.init_process_group(backend, rank=rank, world_size=world_size)之前运行它,运行torch.cuda.set_device(opts.gpu) # https://github.com/pytorch/pytorch/issues/54550会起作用。到目前为止一切顺利,请参阅github.com/brando90/ultimate-utils/blob/master/…,但我不能保证 100%,但我的最小示例似乎有效。
标签: pytorch distributed-computing distributed-system