启动大内核时出现未知错误答案

【问题标题】：Getting unknown error when launching large kernel sizes启动大内核时出现未知错误
【发布时间】：2018-01-10 05:16:11
【问题描述】：

当我的数组大小大于 591 x 591 时，我在启动一个简单内核时遇到了问题。在大小为 591x591 时，数组返回时没有任何错误，但只要我启动网格尺寸为38x38 块，每个块有 16x16 个线程，内核无法启动并返回“未知错误”。

以下代码是我正在调用的内核以及我代码中对内核的调用：

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_device_runtime_api.h>

using namespace std;

#define BLOCKSIZE 16
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__,__LINE__);}

inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
    fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if(abort) exit(code);
}
}



__global__ void IdentityMatrixKernel(float* identity, int size)
{
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;

// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int index = index_y * grid_width + index_x;

// map the two 2D block indices to a single linear, 1D block index
//int result = blockIdx.y * gridDim.x + blockIdx.x;


if (index % (size+1))
{
    identity[index] = 0;
}
else
{
    identity[index] = 1;
}


void foo(float *aArray, int size)
{
float* d_I;
int size2 = size*size*sizeof(float);

gpuErrchk(cudaMalloc(&d_I,size2));

dim3 block_size;
block_size.x = BLOCKSIZE;
block_size.y = BLOCKSIZE;

dim3 grid_size;
grid_size.x = size1/ block_size.x + 1;
grid_size.y = size1/ block_size.y + 1;

IdentityMatrixKernel<<<grid_size,block_size>>>(d_I,size);
gpuErrchk(cudaPeekAtLastError());

gpuErrchk(cudaMemcpy(aArray,d_I,size2,cudaMemcpyDeviceToHost));

cudaFree(d_I);
}

int main()
{
int size = 591;
float *aArray = (float*)malloc(size*size*sizeof(float));

foo(aArray,size);


return 0;
}

对于 size = 591 没有错误显示，输出大小为 591x591 的单位矩阵，但对于任何更大的大小，它会向控制台吐出一个“未知错误”。

【问题讨论】：

我猜这不是您正在运行的代码。有各种编译问题。请检查以确保您发布的代码能够真正编译，并解决任何问题，然后确保它实际演示了问题。然后用cuda-memcheck 运行你的代码，我想你会看到你的内核产生了很多错误（例如，越界访问 - 大小为 4 的无效全局写入等）。

标签： c++ cuda

【解决方案1】：

一个问题似乎是您正在启动一个比实际矩阵更大的线程网格：

grid_size.x = size1/ block_size.x + 1;
grid_size.y = size1/ block_size.y + 1;

但是您没有检查内核中的任何越界访问。您需要添加一个线程检查，例如：

if ((index_x >= size)||(index_y >= size)) return;

靠近内核的开头。但这还不够。另一个问题是你的index计算不正确：

int index = index_y * grid_width + index_x;

从表面上看，它似乎是正确的，但由于您的线程数组大于您的数据数组（可能），这可能会导致错误的索引。既然您将size 传递给内核，请将其更改为如下内容：

int index = index_y * size + index_x;

而且您应该能够消除越界访问。

【讨论】：

啊，谢谢！是的，我改变了我的索引方式，然后从 grid_size 计算中删除了“+1”，它一切正常。

【解决方案2】：

我扩展了 Robert Crovella 的回答。

如果您使用大数字（在您的情况下为 16）定义 block_size.{x, y}，那么您将无法使用较小尺寸的数组，例如4x4。你可以做的是定义一个小块大小：

/* create thread blocks */
dim3 block_size;
block_size.x = 4;
block_size.y = 4;

/* create n x n block grids */
dim3 grid_size;
grid_size.x = size1/block_size.x;
grid_size.y = size1/block_size.y;

/* in case of partial sizes make grid_size 1 x 1 */
if (size1 % block_size.x)
    grid_size.x = 1, grid_size.y = 1;

【讨论】：

16 个线程的线程块通常是一个糟糕的选择。我提到的线程检查可以用来处理小数组。
很抱歉作为答案发布，但还不能发表评论。我只是想在他/她发现下一个问题/障碍之前给他/她建议。了解该解决方案在 GPU 上适用于哪些几何形状至关重要，并且可能 user3390212 从他的代码中的错误中没有想到这一点。