尽管 deviceQuery 测试通过，CUDA 程序仍无法正确执行答案

【问题标题】：CUDA program fails to execute correctly despite that deviceQuery test passed尽管 deviceQuery 测试通过，CUDA 程序仍无法正确执行
【发布时间】：2021-09-22 02:04:40
【问题描述】：

我刚刚在我全新的 Ubuntu 20.04 安装中安装了 nvidia CUDA 工具包。 Nvcc 编译 CUDA 程序，它们运行时不会崩溃。但是，没有一个结果是正确的。

这是 Nvidia 提供的测试脚本 (deviceQuery) 的输出：

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce GTX 770"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1997 MBytes (2093875200 bytes)
  (008) Multiprocessors, (192) CUDA Cores/MP:    1536 CUDA Cores
  GPU Max Clock rate:                            1110 MHz (1.11 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 1 
Result = PASS

这是我正在尝试运行的非常简单的向量加法程序：

#include <cuda_runtime.h>

#include <iostream>
#include <cuda.h>
using namespace std;

int *a, *b;  // host data
int *c;  // results

__global__ void vecAdd(int *A,int *B,int *C)
{
   int i = blockIdx.x * blockDim.x + threadIdx.x;
   C[i] = A[i] + B[i];
}

int main(int argc,char **argv)
{
   printf("Begin \n");
   int n=1000000;
   int nBytes = n*sizeof(int);
   int block_size, block_no;
   a = (int *)malloc(nBytes);
   b = (int *)malloc(nBytes);
   c = (int *)malloc(nBytes);
   int *a_d,*b_d,*c_d;
   block_size=1000;
   block_no = n/block_size;
   for(int i=0;i<n;i++) {
      a[i]=i;
      b[i]=i;
   }

   printf("Allocating device memory on host..\n");
   cudaMalloc((void **)&a_d,n*sizeof(int));
   cudaMalloc((void **)&b_d,n*sizeof(int));

   printf("Copying to device..\n");
   cudaMemcpy(a_d,a,n*sizeof(int),cudaMemcpyHostToDevice);
   cudaMemcpy(b_d,b,n*sizeof(int),cudaMemcpyHostToDevice);

   printf("Doing GPU Vector add\n");
   vecAdd<<<block_no,block_size>>>(a_d,b_d,c_d);

   cudaMemcpy(c,c_d,n*sizeof(int),cudaMemcpyDeviceToHost);
   for(int i = 0; i < 10; i++) {
        std::cout << a[i] << " + " << b[i] << " = " << c[i] << std::endl;
   }
   cudaFree(a_d);
   cudaFree(b_d);
   cudaFree(c_d);
   free(a);
   free(b);
   free(c);
   return 0;
}

最后但并非最不重要的是，这是错误的输出：


Begin
Allocating device memory on host..
Copying to device..
Doing GPU Vector add
0 + 0 = 0
1 + 1 = 0
2 + 2 = 0
3 + 3 = 0
4 + 4 = 0
5 + 5 = 0
6 + 6 = 0
7 + 7 = 0
8 + 8 = 0
9 + 9 = 0

非常感谢任何帮助。

【问题讨论】：

您有多个问题。任何版本的 CUDA 11.x 都不支持计算能力 3.0 设备。切换到 CUDA 10.x。还要注意给出的答案，你的代码确实坏了。此外，在遇到 CUDA 代码问题时，请使用 proper CUDA error checking。
这很有帮助，谢谢。而且我现在看到代码已损坏（也就是说，如果您指的是 d_c 变量缺少的 cudaMalloc）。就像我在回答那个答案时所说的那样，问题仍然存在。希望安装正确版本的 CUDA 可以解决此问题。

标签： c++ parallel-processing cuda gpu

【解决方案1】：

您永远不会在设备上为c 分配存储空间。尝试添加

cudaMalloc((void **)&c_d,n*sizeof(int));

在调用 CUDA 内核之前。

【讨论】：

我实际上只是在上传我的问题之前删除了完全相同的代码行，因为我确信它不正确。同样的问题仍然存在
当我添加这一行时，在另一个已知良好/工作的 CUDA 设置上，我得到了预期的结果。
罗伯特，很高兴知道这一点，谢谢。至于你所说的用户只能帮助提供所提供的信息，我完全同意。那是我的错。说真的，非常感谢您的评论、您的回答以及花时间运行我的代码。
如果您想避免完全在黑暗中进行调试，我强烈建议您在代码中添加错误处理。当我在没有 GPU 驱动程序的机器上运行代码时，我得到您报告的全为零的结果。添加错误报告代码后，我会看到 GPUassert: CUDA driver version is insufficient for CUDA runtime version main.cu 45 更有帮助