【发布时间】:2016-01-04 17:23:55
【问题描述】:
我是 C++/CUDA 的新手。我尝试实现并行算法“reduce”,它能够处理任何类型的输入大小和线程大小,而不会通过递归内核的输出(在 内核包装器中 em>)。
例如Implementing Max Reduce in Cuda 这个问题的最佳答案,当线程大小足够小时,他/她的实现基本上是顺序的。
但是,当我编译和运行它时,我不断收到“分段错误”..?
>> nvcc -o mycode mycode.cu
>> ./mycode
Segmentail fault.
在带有 cuda 6.5 的 K40 上编译
这里是内核,与我链接检查器“越界”的SO帖子基本相同:
#include <stdio.h>
/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
// position and threadId
int pos = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// do reduction in global memory
for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
{
if (tid < s)
{
if (pos+s < size) // Handling out of bounds
{
d_in[pos] = d_in[pos] + d_in[pos+s];
}
}
}
// only thread 0 writes result, as thread
if (tid==0)
{
d_out[blockIdx.x] = d_in[pos];
}
}
我提到的内核包装器在 1 个块不包含所有数据时处理。
/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, const int size, int num_threads)
{
// setting up blocks and intermediate result holder
int num_blocks = ((size) / num_threads) + 1;
float * d_intermediate;
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
// recursively solving, will run approximately log base num_threads times.
do
{
reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);
// updating input to intermediate
cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);
// Updating num_blocks to reflect how many blocks we now want to compute on
num_blocks = num_blocks / num_threads + 1;
// updating intermediate
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while(num_blocks > num_threads); // if it is too small, compute rest.
// computing rest
reduce_kernel<<<1, num_blocks>>>(d_out, d_in, size);
}
用于初始化输入/输出并创建用于测试的虚假数据的主程序。
/* -------- MAIN -------- */
int main(int argc, char **argv)
{
// Setting num_threads
int num_threads = 512;
// Making bogus data and setting it on the GPU
const int size = 1024;
const int size_out = 1;
float * d_in;
float * d_out;
cudaMalloc(&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size_out);
const int value = 5;
cudaMemset(d_in, value, sizeof(float)*size);
// Running kernel wrapper
reduce(d_out, d_in, size, num_threads);
printf("sum is element is: %.f", d_out[0]);
}
【问题讨论】:
-
主机代码出现分段错误,而不是 CUDA 设备代码。询问 SO 上的分段错误时的良好做法是识别导致错误的行(段错误始终可以定位到实际生成错误的特定代码行)。这种本地化可以通过分散在代码中的
printf语句或通过调试器轻松完成。
标签: c++ algorithm cuda parallel-processing reduce