【问题标题】:Error: Some NCCL operations have failed or timed out错误:某些 NCCL 操作失败或超时
【发布时间】:2021-12-10 02:32:33
【问题描述】:

在 4 个 A6000 GPU 上运行分布式训练时,我收到以下错误:

[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803710 milliseconds before timing out.       
                                                                                                                                                        [E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.                                                                                 

terminate called after throwing an instance of 'std::runtime_error'                                                                                                        
what():  [Rank 2] Watchdog caught collective operation timeout: 
WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804406 milliseconds before timing out.        

[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

我使用标准的 NVidia PyTorch docker。有趣的是,训练对于小数据集效果很好,但对于更大的数据集,我得到了这个错误。所以我可以确认训练代码是正确的并且确实有效。

没有实际的运行时错误或任何其他信息可以在任何地方获得实际的错误消息。

【问题讨论】:

    标签: pytorch gpu distributed nvidia-docker


    【解决方案1】:

    以下两个已经解决了这个问题:

    • 将 CUDA 的默认 SHM(共享内存)增加到 10g(我认为 1g 也可以)。您可以通过传递--shm-size=10g 在 docker run 命令中执行此操作。我也通过--ulimit memlock=-1
    • export NCCL_P2P_LEVEL=NVL

    调试提示

    要检查当前的 SHM,

    df -h
    # see the row for shm
    

    要查看 NCCL 调试消息:

    export NCCL_DEBUG=INFO
    

    为 GPU 到 GPU 的通信链路运行 p2p 带宽测试:

    cd /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
    sudo make
    ./p2pBandwidthLatencyTest
    

    对于 A6000 4 GPU 盒,打印如下:

    矩阵显示每对GPU之间的带宽和P2P,它应该很高。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-12-25
      • 1970-01-01
      • 2022-01-04
      • 1970-01-01
      • 2017-01-10
      • 2017-03-07
      • 2016-10-21
      • 1970-01-01
      相关资源
      最近更新 更多