CUDA 减少非连续子阵列答案

【问题标题】：CUDA reduce non-contiguous sub-arraysCUDA 减少非连续子阵列
【发布时间】：2021-06-28 22:08:38
【问题描述】：

我正在为一个库编写一个函数，该函数接受一个包含 2 次幂元素的大型数组（在 GPU 内存中）。此函数必须对不连续的子数组（长度相等，也是 2 的幂）求和，以生成更小（或很少，大小相等）的数组。这是here 描述的 OpenMP 函数的 GPU 版本。

例如，

in[8] = { ... }
out = { in[0]+in[2],  in[1]+in[3],  in[4]+in[6],  in[5]+in[7] }

缩减子数组的元素由用户给定的位索引列表bitInds 确定。它们告知in 元素的索引（想象为位串）如何映射到out 的索引。

例如，如果

bits = { 0, 2 }

那么in index 0 = 000 映射到out[0]，in index 4 = 100 映射到out[2]。

位的长度范围可以从1 到log2(length(in))。 out 的长度为pow(2, length(bits))。这意味着out 可以和in 一样大，或者是一半大，或者是四分之一大，等等。

为此功能准备设备内存，并在 CUDA 内核中执行结构良好的缩减似乎具有挑战性，我不确定如何开始。由于in 被保证非常大，因此重要的线程访问in 本地和顺序。如何有效地对in 的可能不连续子数组执行pow(2,length(bits)) 缩减？

【问题讨论】：

你可以尝试使用原子。
您有推荐的 CUDA atomics 参考资料吗？
您所询问的内容与直方图有一些相似之处，因此this 如果您的输出数组的长度在 4096 或更小的范围内，您可以使用共享内存方法。否则只是常规的全局原子。
除最低位外，全局内存是随机访问的。高速缓存行是 128 字节宽。因此，连续 128 字节的输入尤其是您的输入，而且您的输出也应该由一个内存请求中的扭曲写入（假设 4 个字节/元素）。您的内核可以模板化或自动生成，具体取决于最低 5 位（索引 0 到 31 给出较低的 128 字节）是否是 bitInds 的一部分。读/写时，您可以随机播放元素或使用共享内存。
对于高位：对于大型输出数组，您有许多独立的归约，然后您可以每个线程进行一次归约（无需同步），对于中等数组，您可以在块内归约最后，对于一个小的输出数组，你可以先用现有的方法写一个更大的输出数组的中间结果，然后再减少一些位。当您如此笼统地陈述您的问题时，有很多案例涉及不同的优化策略。超过 128 字节的缓存和局部性无济于事，因为每个元素只读一次。

标签： c++ cuda reduction

【解决方案1】：

我尝试了三种实现并比较了它们的性能。

A：根据@RobertCrovella 的example，在每块独占全局内存上使用atomics，然后在块之间减少。
B：让每个线程直接原子地写入全局内存。
C：让线程写入块本地共享内存，最后以原子方式将它们减少到全局内存。

在性能和尺寸限制方面的最佳解决方案出人意料地是最简单的B。我介绍了伪代码，并按复杂性顺序描述了这些解决方案的权衡取舍。

B

此方法让每个线程原子地写入全局内存，直接写入最终输出。除了将out 和in 放入全局内存之外，它不会对它们的大小施加额外的限制。

__global__ void myfunc_kernel(double* in, inLen, double* out, ...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine 'outInd' from other args
    int outInd = ...
    
    // atomically add directly to out global memory (vs all threads)
    atomicAdd(&out[outInd], in[inInd]);
}

void myfunc(double* in, int inLen, double* out,  ...) {

    // copy unspecified additional args into device memory (shared, if fits)
    double d_in = ...

    // create global memory for final output (initialised to zero)
    double d_out = ...
    cudaMemset(d_out, 0, ...);
    
    // allocate one thread per element of 'in'
    int numThreadsPerBlock = 128;
    int numBlocks = ceil(inLen / (double) numThreadsPerBlock);

    myfunc_kernel<<numBlocks, numThreadsPerBlock>>(d_in, inLen, d_out, ...);

    // copy output to host, and clean up
    cudaMemcpy(out, d_out, ...);
    cudaFree(...);
}

C

此方法通过原子分配每个块共享内存，块中的线程在本地减少到这些共享内存。此后，通过原子将块之间的共享内存减少到输出全局内存中。必须将本地 out 放入共享内存中，将 outLen 限制为最多（例如）4096（取决于计算能力）。

__global__ void myfunc_kernel(double* in, int inLen, double* out, ...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // each block has a local copy of 'out', in shared memory
    extern __shared__ double outBlock[];
    
    // efficiently initialise shared memory to zero
    // (looped in case fewer threads than outLen)
    for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
        outBlock[i] = 0;
    __syncthreads();

    // determine 'outInd' from other args
    int outInd = ...
    
    // atomically contribute thread's 'in' value to block shared memory
    // (vs other threads in block)
    atomicAdd(&outBlock[outInd], in[inInd]);
    __syncthreads();

    // keep one (or fewer) thread(s) per output element alive in each block
    if (threadIdx.x >= outLen)
        return;

    // remaining threads atomically reduced shared memory into global
    // (looped in-case fewer threads than 'outLen')
    for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
        // (vs all surviving threads)
        atomicAdd(&out[i], outBlock[i]);

    /* This final reduction may be possible in parallel/tiers,
     * though documentation & examples about reducing shared memory
     * between blocks is disappointingly scarce
     */
}

void myFunc(double* in, int inLen, double* out, int outLen, ...) {
    
    // if 'out' cannot fit in each block's shared memory, error and quit
    if (outLen > 4096)
        ...
       
    // copy unspecified additional args into device memory
    double d_in = ...

    // create global memory for final output (initialised to zero)
    double d_out = ...
    cudaMemset(d_out, 0, ...);
    
    // allocate one thread per element of 'in'
    int numThreadsPerBlock = 128;
    int numBlocks = ceil(inLen / (double) numThreadsPerBlock);

    // allocate shared memory to fit 'out' in each block
    size_t mem = outLen * sizeof(double);
    myfunc_kernel<<numBlocks, numThreadsPerBlock, mem>>(d_in, inLen, d_out, ...);

    // copy output to host, and clean up
    cudaMemcpy(out, d_out, ...);
    cudaFree(...);
}

请注意，所有归约最终都是以原子方式执行的。这应该不是必需的，因为在共享内存完成后，减少到全局内存具有严格的结构。但是，我无法在网上找到有关如何将共享内存数组共同减少为全局内存的文档或单个演示。

一个

此方法使用与此histogram example 相同的范例。它在全局内存中创建out 的#numBlocks 副本，以便每个块写入独占内存。块内的线程以原子方式写入其块的全局内存分区。此后，第二个内核将这些分区减少为最终的out 数组。这显着限制了outLen，使其小于deviceGlobalMem/numBlocks。

__global__ void myfunc_populate_kernel(double* in, int inLen, double* outs, int outLen, ...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine 'outInd' from other args
    int outInd = ...

    // map to this block's exclusive subarray in global memory
    int blockOutInd = blockIdx.x*outLen + outInd;
    
    // atomically contribute thread's 'in' value to block's global memory
    // (vs threads in same block)
    atomicAdd(&outs[outInd], in[inInd]);
}

__global__ void myfunc_reduce_kernel(double* out, double* outs, int outLen, int numOuts) {

    // each thread handles one element of 'out'
    int outInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (outInd >= outLen)
        return;
  
    // and determines it by reducing previous per-block partitions
    double total = 0;
    for (int n=0; n<numOuts; n++)
        total += outs[outInd + n*outLen];

    out[outInd] = total;
}

void myfunc(double* in, int inLen, double* outs, int outLen, ...) {
    
    int numThreadsPerBlock = 128;
    int numPopBlocks = ceil(inLen / (double) numThreadsPerBlock);
 
    // if #numPopBlocks copies of 'out' cannot fit in global memory, error and exit
    if (outLen * numPopBlocks > ...)
        ...

    // copy unspecified additional args into device memory (shared, if fits)
    double d_in = ...

    // create a working 'out' for each block, in global memory (init to zero)
    size_t mem = outLen * numPopBlocks * sizeof(double);
    double d_outs = ...
    cudaMalloc(&d_outs, mem);
    cudaMemset(d_out, 0, mem);
    
    // populate each blocks partition of global memory 'd_outs'
    myfunc_populate_kernel<<numPopBlocks, numThreadsPerBlock>>(d_in, inLen, d_outs, outLen, ...);

    // create global memory for final output
    double d_out = ...

    // reduce each blocks partition of global memory into output 'd_out'
    int numReduceBlocks = ceil(outLen / (double) numThreadsPerBlock);
    myfunc_reduce_kernel<<numReduceBlocks, numThreadsPerBlock>>(d_out, d_outs, outLen, numPopBolocks);

    //copy output to host, and clean up
    cudaMemcpy(out, d_out, ...);
    cudaFree(...);
}

比较

我使用 Quadro P6000（24GB，6.1 CC）、inLen = 2^5, 2^10, 2^15, 2^20, 2^25 和（对于所有值）outLen = m2^5, 2^10, 2^15 (outLen <= inLen) 测试了这些方法。这是运行时表：

╔══════════╦═════════╦══════════╦══════════╦══════════╗
║  inLen   ║ outLen  ║  A       ║  B       ║  C       ║
╠══════════╬═════════╬══════════╬══════════╬══════════╣
║       32 ║      32 ║ 0.000067 ║ 0.000071 ║ 0.000039 ║
║     1024 ║      32 ║ 0.000244 ║ 0.000044 ║ 0.000071 ║
║     1024 ║    1024 ║ 0.000265 ║ 0.000043 ║ 0.000064 ║
║    32768 ║      32 ║ 0.000290 ║ 0.000077 ║ 0.000080 ║
║    32768 ║    1024 ║ 0.000287 ║ 0.000059 ║ 0.000096 ║
║    32768 ║   32768 ║ 0.000762 ║ 0.000111 ║ N/A      ║
║  1048576 ║      32 ║ 0.000984 ║ 0.000793 ║ 0.000254 ║
║  1048576 ║    1024 ║ 0.001381 ║ 0.000343 ║ 0.002302 ║
║  1048576 ║   32768 ║ 0.017547 ║ 0.000899 ║ N/A      ║
║  1048576 ║ 1048576 ║ N/A      ║ 0.006092 ║ N/A      ║
║ 33554432 ║      32 ║ 0.021068 ║ 0.022684 ║ 0.008079 ║
║ 33554432 ║    1024 ║ 0.032839 ║ 0.008190 ║ 0.014071 ║
║ 33554432 ║   32768 ║ 0.013734 ║ 0.011915 ║ N/A      ║
╚══════════╩═════════╩══════════╩══════════╩══════════╝

尽管限制较少，但方法 B 显然是赢家，如下所示：

╔══════════════╦══════╦═════╗
║ outLen/inLen ║ A/B  ║ C/B ║
╠══════════════╬══════╬═════╣
║            1 ║ 0.9  ║ 0.5 ║
║           32 ║ 5.5  ║ 1.6 ║
║            1 ║ 6.2  ║ 1.5 ║
║         1024 ║ 3.8  ║ 1.0 ║
║           32 ║ 4.9  ║ 1.6 ║
║            1 ║ 6.9  ║     ║
║        32768 ║ 1.2  ║ 0.3 ║
║         1024 ║ 4.0  ║ 6.7 ║
║           32 ║ 19.5 ║     ║
║            1 ║      ║     ║
║      1048576 ║ 0.9  ║ 0.4 ║
║        32768 ║ 4.0  ║ 1.7 ║
║         1024 ║ 1.2  ║     ║
╚══════════════╩══════╩═════╝

陷阱

很难对 CUDA 文档和在线示例的状态保持礼貌。所以我不会发表评论，但向困惑的读者提供这些陷阱警告：

请注意，虽然共享内存似乎在内核第一次执行时被初始化为零，但在后续执行中不会重新初始化。因此，您必须在内核中re-initialise manually（没有方便的 CUDA 函数）。
atomicAdd 缺少 double 类型的重载。 doc 声称这仅适用于小于 60 的计算能力，但我的代码在没有手动过载的情况下无法使用我的 62 CC GPU 进行编译。此外，#if __CUDA_ARCH__ < 600 的doc's code-sn-p 是错误的，因为较新的 arch 预编译器没有定义 __CUDA_ARCH__。查看正确的宏here
请注意，您在线查看的流行并行缩减范例将假定所有数据都在全局内存中，或者在单个块的共享内存中。将共享内存简化为全局结构似乎很深奥。

确定 bitInd

为了让感兴趣的读者更好地理解out 访问模式，下面是如何从每个线程中确定bitInd，其中（所有块组合之间）有inLen。注意实际的索引类型是long long int，因为inLen 和outLen 都可以和2^30 = 1073741824 一样大

__global__ void myfunc_kernel(
    double* in, int inLen, double* out, int* bitInds, int numBits
) {

    long long int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine the number encoded by bits {bitInds} of inInd
    long long int outInd = 0;
    for (int b=0; b<numBits; b++) {
        int bit = (inInd >> bitInds[b]) & 1;
        outInd += bit * (1LL << b);
    }

    ...
}

对于bitInds 的一些可能输入，访问模式很简单。例如：

如果bitInds = {0, 1, ..., n-1}（从0 连续增加），那么outInd = inInd % 2^n，我们的内核（次优）执行减少in 的连续分区。
如果是bitInds = {n, n+1, n+2, ..., log2(inLen)-1}，那么outInd = floor(inInd / 2^(log2(inLen) - n))，或者只是outInd = floor(inInd 2^n / inLen)。

但是，如果bitInds 是索引{0, ..., log2(inLen)-1} 的任意排序子集，则必须按照上述执行从每个位确定outInd。

【讨论】：

所有添加都需要对全局内存进行原子操作似乎很浪费。在更新全局内存之前，你不能让线程在本地添加一些元素吗？
这个库将进入的函数不执行任何自动调整，并试图对用户的 GPU 做一些假设。此外，输入的大小（如您所见）呈指数变化，并且所有末端都是现实的用例。因此很难进行这样的优化
我的评论与自动调整或对用户 GPU 的假设无关。至于输入大小 - 没关系。也就是说，如果输入是如此之小以至于被一个元素覆盖，例如，GPU 可以一次保存的每个可调度扭曲 - 那么优化无论哪种方式都无关紧要，因为无论如何它都不会花费很长时间.如果输入大于（例如，数十万个或更多元素），那么我的建议是中肯的。
共享内存即使在第一次执行时也不能保证被初始化。您不能依赖共享内存的初始化，您必须始终自己进行，以确保正确性。我为 cc6.2 编译 double atomicAdd 没有问题。您必须指定要编译的目标。编译器（即nvcc）不会关注您可能拥有或不拥有的任何 GPU。我在尝试帮助您时遇到的一个问题是，我仍然不明白如何选择 outInd。在您的所有示例中，您似乎都忽略了这一点，但这对于实际性能分析显然很重要。
@RobertCrovella 好点，虽然我希望你相信我的访问模式一般不能优化:) 我在我的答案中添加了一个新部分来解释 outInd 是如何计算出来的。关于atomicAdd - 是的，我了解如何提供计算能力