cuda 块同步答案

【问题标题】：cuda block synchronizationcuda 块同步
【发布时间】：2011-09-18 07:11:07
【问题描述】：

我有 b 个块，每个块有 t 个线程。我可以使用

 __syncthreads()

同步特定块中的线程。例如

__global__ void aFunction()
{
    for(i=0;i<10;i++)
    {
       //execute something
        __syncthreads();
    }
}

但我的问题是同步所有块中的所有线程。我该怎么做？

【问题讨论】：

标签： cuda

【解决方案1】：

在 CUDA 9 中，NVIDIA 引入了合作组的概念，允许您同步属于该组的所有线程。这样的组可以跨越网格中的所有线程。这样您就可以同步所有块中的所有线程：

#include <cuda_runtime_api.h> 
#include <cuda.h> 
#include <cooperative_groups.h>

cooperative_groups::grid_group g = cooperative_groups::this_grid(); 
g.sync();

您需要 Pascal（计算能力 60）或更新的架构来同步网格。此外，还有更具体的要求。见：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

所有架构都支持基本功能，例如将小于线程块的组同步到扭曲粒度，而 Pascal 和 Volta GPU 支持新的网格范围和多 GPU 同步组。

来源：https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

在 CUDA 9 之前，没有本地方法可以同步所有块中的所有线程。实际上，CUDA 中的块的概念是，一些可能只有在其他一些块已经结束其工作后才会启动，例如，如果它运行的 GPU 太弱而无法并行处理它们。

如果您确保不会生成太多块，您可以尝试在它们之间同步所有块，例如通过使用原子操作主动等待。然而，这很慢，会耗尽您的 GPU 内存控制器，被认为是“黑客行为”，应该避免。

因此，如果您不针对 Pascal（或更新的）架构，我建议的最佳方法是在同步点简单地终止您的内核，然后启动一个新的内核以继续您的工作。在大多数情况下，它实际上会比使用提到的 hack 更快（或至少 - 以类似的速度）。

【讨论】：

耶！在最初的答案 6 年后，一个新版本的 CUDA 让我重新审视它并给出一个更积极的解决方案:)
更具体地说#include #include #include ...合作组::grid_group grp =合作组::this_grid(); grp.sync();
@ragerdl 这个代码是写在cuda内核还是主程序？
@AndreasHadjigeorgiou 我在解决方案中的代码以及 ragerdl 的代码都是在内核中编写的。当然，包括在内。
谢谢你，它工作得很好！请注意，您需要定义如下：collaborative_groups::grid_group g = Cooperation_groups::this_grid();并考虑阅读此内容，因为内核应通过 API cudaLaunchCooperativeKernel 启动

【解决方案2】：

合作组有一些要求，比如需要通过cudaLaunchCooperativeKernel 启动你的内核。这使得它不是简单项目的好解决方案。

一个简单的替代方法是使用带有位域的原子，如下所示：

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ uint64_t CompleteMask; 

//This is where we put in all the smarts
//from the CPU reference solver
__global__ void doWork() {
    atomicAnd(&CompleteMask, 0);
    //do lots of work

    const auto SollMask = (1 << gridDim.x) - 1;
    if (ThreadId() == 0) {
        while ((atomicOr(&CompleteMask, 1ULL << blockIdx.x)) != SollMask) { /*do nothing*/ }
    }
    if (ThreadId() == 0 && 0 == blockIdx.x) {
        printf("Print a single line for the entire process")
    }
}

因为每个块都在掩码中分配了自己的位，所以它们永远不会干扰。如果您有超过 64 个块，请使用数组来跟踪位，并使用atomicAdd 来跟踪计数，如下所示：

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ int CompleteMask[2];
__device__ int CompleteSuperMask;

__global__ void doWork() {
    for (auto i = 0; i < 2; i++) { atomicAnd(&CompleteMask[i], 0); }
    atomicAnd(&CompleteSuperMask, 0);
    //do lots of work

    int SollMask[3];
    SollMask[0] = -1;
    SollMask[1] = (1 << (gridDim.x % 32)) - 1;
    SollMask[2] = (1 << (gridDim.x / 32)) - 1;

    const auto b = blockIdx.x / 32;
    while (atomicOr(&CompleteMask[b], (1U << (blockIdx.x % 32))) != SollMask[b]) { /*do nothing*/ }

    while (atomicOr(&CompleteSuperMask, (1U << b)) != SollMask[2]) { /*do nothing*/ }
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Print a single line for the entire process");
    }
}

【讨论】：