Pre 8.x 相当于 CUDA 中的 __reduce_max_sync()答案

【问题标题】：Pre 8.x equivalent of __reduce_max_sync() in CUDAPre 8.x 相当于 CUDA 中的 __reduce_max_sync()
【发布时间】：2021-12-04 08:25:28
【问题描述】：

cuda-memcheck 在执行以下操作的代码中检测到竞争条件：

condition = /*different in each thread*/;
shared int owner[nWarps];
/* ... owner[i] is initialized to blockDim.x+1 */
if(condition) {
    owner[threadIdx.x/32] = threadIdx.x;
}

所以基本上这段代码根据某些条件计算每个经纱的所有者线程。对于某些 warp 可能没有所有者，但对于某些所有者的数量可能超过 1，然后会发生竞争条件，因为多个线程将值分配给同一共享内存区域。

在尝试了文档之后，我认为我需要的可以完成：

const uint32_t mask = __ballot_sync(0xffffffff, condition);
if(mask != 0) {
    const unsigned max_owner = __reduce_max_sync(mask, threadIdx.x);
    if(threadIdx.x == max_owner) {
        // at most 1 thread assigns here per warp
        owner[threadIdx.x/32] = max_owner;
    }
}

但是，我的尝试有两个问题：

我真的不需要找到最大线程 - 如果有 condition==true 的线程，为每个经纱选择任何 1 个线程就足够了
它需要 CUDA 计算能力 8.x，而我需要支持 5.2 计算能力的设备

您能帮我解决以上问题吗？

【问题讨论】：

为什么让比赛条件决定，不合适？如果您将共享内存声明为 volatile 并同步 warp，您可以回读并知道所有者是谁。
@Sebastian , cuda-memcheck --tool racecheck 抱怨此类代码级别为 ERROR。

标签： c++ parallel-processing cuda gpu-warp compute-capability

【解决方案1】：

以下函数似乎可以解决问题：

void SetOwnerThread(int* dest, const bool condition) {
  const uint32_t mask = __ballot_sync(0xffffffff, condition);
  if(!mask) {
    return;
  }
  const uint32_t lowest_bit = mask & -mask;
  const uint32_t my_bit = (1 << (threadIdx.x & 31));
  if(lowest_bit == my_bit) {
    dest = threadIdx.x;
  }
}

【讨论】：