我尝试了三种实现并比较了它们的性能。
-
A:根据@RobertCrovella 的example,在每块独占全局内存上使用atomics,然后在块之间减少。
-
B:让每个线程直接原子地写入全局内存。
-
C:让线程写入块本地共享内存,最后以原子方式将它们减少到全局内存。
在性能和尺寸限制方面的最佳解决方案出人意料地是最简单的B。我介绍了伪代码,并按复杂性顺序描述了这些解决方案的权衡取舍。
B
此方法让每个线程原子地写入全局内存,直接写入最终输出。除了将out 和in 放入全局内存之外,它不会对它们的大小施加额外的限制。
__global__ void myfunc_kernel(double* in, inLen, double* out, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine 'outInd' from other args
int outInd = ...
// atomically add directly to out global memory (vs all threads)
atomicAdd(&out[outInd], in[inInd]);
}
void myfunc(double* in, int inLen, double* out, ...) {
// copy unspecified additional args into device memory (shared, if fits)
double d_in = ...
// create global memory for final output (initialised to zero)
double d_out = ...
cudaMemset(d_out, 0, ...);
// allocate one thread per element of 'in'
int numThreadsPerBlock = 128;
int numBlocks = ceil(inLen / (double) numThreadsPerBlock);
myfunc_kernel<<numBlocks, numThreadsPerBlock>>(d_in, inLen, d_out, ...);
// copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
C
此方法通过原子分配每个块共享内存,块中的线程在本地减少到这些共享内存。此后,通过原子将块之间的共享内存减少到输出全局内存中。必须将本地 out 放入共享内存中,将 outLen 限制为最多(例如)4096(取决于计算能力)。
__global__ void myfunc_kernel(double* in, int inLen, double* out, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// each block has a local copy of 'out', in shared memory
extern __shared__ double outBlock[];
// efficiently initialise shared memory to zero
// (looped in case fewer threads than outLen)
for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
outBlock[i] = 0;
__syncthreads();
// determine 'outInd' from other args
int outInd = ...
// atomically contribute thread's 'in' value to block shared memory
// (vs other threads in block)
atomicAdd(&outBlock[outInd], in[inInd]);
__syncthreads();
// keep one (or fewer) thread(s) per output element alive in each block
if (threadIdx.x >= outLen)
return;
// remaining threads atomically reduced shared memory into global
// (looped in-case fewer threads than 'outLen')
for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
// (vs all surviving threads)
atomicAdd(&out[i], outBlock[i]);
/* This final reduction may be possible in parallel/tiers,
* though documentation & examples about reducing shared memory
* between blocks is disappointingly scarce
*/
}
void myFunc(double* in, int inLen, double* out, int outLen, ...) {
// if 'out' cannot fit in each block's shared memory, error and quit
if (outLen > 4096)
...
// copy unspecified additional args into device memory
double d_in = ...
// create global memory for final output (initialised to zero)
double d_out = ...
cudaMemset(d_out, 0, ...);
// allocate one thread per element of 'in'
int numThreadsPerBlock = 128;
int numBlocks = ceil(inLen / (double) numThreadsPerBlock);
// allocate shared memory to fit 'out' in each block
size_t mem = outLen * sizeof(double);
myfunc_kernel<<numBlocks, numThreadsPerBlock, mem>>(d_in, inLen, d_out, ...);
// copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
请注意,所有归约最终都是以原子方式执行的。这应该不是必需的,因为在共享内存完成后,减少到全局内存具有严格的结构。但是,我无法在网上找到有关如何将共享内存数组共同减少为全局内存的文档或单个演示。
一个
此方法使用与此histogram example 相同的范例。它在全局内存中创建out 的#numBlocks 副本,以便每个块写入独占内存。块内的线程以原子方式写入其块的全局内存分区。此后,第二个内核将这些分区减少为最终的out 数组。这显着限制了outLen,使其小于deviceGlobalMem/numBlocks。
__global__ void myfunc_populate_kernel(double* in, int inLen, double* outs, int outLen, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine 'outInd' from other args
int outInd = ...
// map to this block's exclusive subarray in global memory
int blockOutInd = blockIdx.x*outLen + outInd;
// atomically contribute thread's 'in' value to block's global memory
// (vs threads in same block)
atomicAdd(&outs[outInd], in[inInd]);
}
__global__ void myfunc_reduce_kernel(double* out, double* outs, int outLen, int numOuts) {
// each thread handles one element of 'out'
int outInd = blockIdx.x*blockDim.x + threadIdx.x;
if (outInd >= outLen)
return;
// and determines it by reducing previous per-block partitions
double total = 0;
for (int n=0; n<numOuts; n++)
total += outs[outInd + n*outLen];
out[outInd] = total;
}
void myfunc(double* in, int inLen, double* outs, int outLen, ...) {
int numThreadsPerBlock = 128;
int numPopBlocks = ceil(inLen / (double) numThreadsPerBlock);
// if #numPopBlocks copies of 'out' cannot fit in global memory, error and exit
if (outLen * numPopBlocks > ...)
...
// copy unspecified additional args into device memory (shared, if fits)
double d_in = ...
// create a working 'out' for each block, in global memory (init to zero)
size_t mem = outLen * numPopBlocks * sizeof(double);
double d_outs = ...
cudaMalloc(&d_outs, mem);
cudaMemset(d_out, 0, mem);
// populate each blocks partition of global memory 'd_outs'
myfunc_populate_kernel<<numPopBlocks, numThreadsPerBlock>>(d_in, inLen, d_outs, outLen, ...);
// create global memory for final output
double d_out = ...
// reduce each blocks partition of global memory into output 'd_out'
int numReduceBlocks = ceil(outLen / (double) numThreadsPerBlock);
myfunc_reduce_kernel<<numReduceBlocks, numThreadsPerBlock>>(d_out, d_outs, outLen, numPopBolocks);
//copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
比较
我使用 Quadro P6000(24GB,6.1 CC)、inLen = 2^5, 2^10, 2^15, 2^20, 2^25 和(对于所有值)outLen = m2^5, 2^10, 2^15 (outLen <= inLen) 测试了这些方法。这是运行时表:
╔══════════╦═════════╦══════════╦══════════╦══════════╗
║ inLen ║ outLen ║ A ║ B ║ C ║
╠══════════╬═════════╬══════════╬══════════╬══════════╣
║ 32 ║ 32 ║ 0.000067 ║ 0.000071 ║ 0.000039 ║
║ 1024 ║ 32 ║ 0.000244 ║ 0.000044 ║ 0.000071 ║
║ 1024 ║ 1024 ║ 0.000265 ║ 0.000043 ║ 0.000064 ║
║ 32768 ║ 32 ║ 0.000290 ║ 0.000077 ║ 0.000080 ║
║ 32768 ║ 1024 ║ 0.000287 ║ 0.000059 ║ 0.000096 ║
║ 32768 ║ 32768 ║ 0.000762 ║ 0.000111 ║ N/A ║
║ 1048576 ║ 32 ║ 0.000984 ║ 0.000793 ║ 0.000254 ║
║ 1048576 ║ 1024 ║ 0.001381 ║ 0.000343 ║ 0.002302 ║
║ 1048576 ║ 32768 ║ 0.017547 ║ 0.000899 ║ N/A ║
║ 1048576 ║ 1048576 ║ N/A ║ 0.006092 ║ N/A ║
║ 33554432 ║ 32 ║ 0.021068 ║ 0.022684 ║ 0.008079 ║
║ 33554432 ║ 1024 ║ 0.032839 ║ 0.008190 ║ 0.014071 ║
║ 33554432 ║ 32768 ║ 0.013734 ║ 0.011915 ║ N/A ║
╚══════════╩═════════╩══════════╩══════════╩══════════╝
尽管限制较少,但方法 B 显然是赢家,如下所示:
╔══════════════╦══════╦═════╗
║ outLen/inLen ║ A/B ║ C/B ║
╠══════════════╬══════╬═════╣
║ 1 ║ 0.9 ║ 0.5 ║
║ 32 ║ 5.5 ║ 1.6 ║
║ 1 ║ 6.2 ║ 1.5 ║
║ 1024 ║ 3.8 ║ 1.0 ║
║ 32 ║ 4.9 ║ 1.6 ║
║ 1 ║ 6.9 ║ ║
║ 32768 ║ 1.2 ║ 0.3 ║
║ 1024 ║ 4.0 ║ 6.7 ║
║ 32 ║ 19.5 ║ ║
║ 1 ║ ║ ║
║ 1048576 ║ 0.9 ║ 0.4 ║
║ 32768 ║ 4.0 ║ 1.7 ║
║ 1024 ║ 1.2 ║ ║
╚══════════════╩══════╩═════╝
陷阱
很难对 CUDA 文档和在线示例的状态保持礼貌。所以我不会发表评论,但向困惑的读者提供这些陷阱警告:
- 请注意,虽然共享内存似乎在内核第一次执行时被初始化为零,但在后续执行中不会重新初始化。因此,您必须在内核中re-initialise manually(没有方便的 CUDA 函数)。
-
atomicAdd 缺少 double 类型的重载。 doc 声称这仅适用于小于 60 的计算能力,但我的代码在没有手动过载的情况下无法使用我的 62 CC GPU 进行编译。此外,#if __CUDA_ARCH__ < 600 的doc's code-sn-p 是错误的,因为较新的 arch 预编译器没有定义 __CUDA_ARCH__。查看正确的宏here
- 请注意,您在线查看的流行并行缩减范例将假定所有数据都在全局内存中,或者在单个块的共享内存中。将共享内存简化为全局结构似乎很深奥。
确定 bitInd
为了让感兴趣的读者更好地理解out 访问模式,下面是如何从每个线程中确定bitInd,其中(所有块组合之间)有inLen。注意实际的索引类型是long long int,因为inLen 和outLen 都可以和2^30 = 1073741824 一样大
__global__ void myfunc_kernel(
double* in, int inLen, double* out, int* bitInds, int numBits
) {
long long int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine the number encoded by bits {bitInds} of inInd
long long int outInd = 0;
for (int b=0; b<numBits; b++) {
int bit = (inInd >> bitInds[b]) & 1;
outInd += bit * (1LL << b);
}
...
}
对于bitInds 的一些可能输入,访问模式很简单。例如:
- 如果
bitInds = {0, 1, ..., n-1}(从0 连续增加),那么outInd = inInd % 2^n,我们的内核(次优)执行减少in 的连续分区。
- 如果是
bitInds = {n, n+1, n+2, ..., log2(inLen)-1},那么outInd = floor(inInd / 2^(log2(inLen) - n)),或者只是outInd = floor(inInd 2^n / inLen)。
但是,如果bitInds 是索引{0, ..., log2(inLen)-1} 的任意排序子集,则必须按照上述执行从每个位确定outInd。