【发布时间】:2021-12-10 02:32:33
【问题描述】:
在 4 个 A6000 GPU 上运行分布式训练时,我收到以下错误:
[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803710 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout:
WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804406 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
我使用标准的 NVidia PyTorch docker。有趣的是,训练对于小数据集效果很好,但对于更大的数据集,我得到了这个错误。所以我可以确认训练代码是正确的并且确实有效。
没有实际的运行时错误或任何其他信息可以在任何地方获得实际的错误消息。
【问题讨论】:
标签: pytorch gpu distributed nvidia-docker