【发布时间】:2021-09-22 02:04:40
【问题描述】:
我刚刚在我全新的 Ubuntu 20.04 安装中安装了 nvidia CUDA 工具包。 Nvcc 编译 CUDA 程序,它们运行时不会崩溃。但是,没有一个结果是正确的。
这是 Nvidia 提供的测试脚本 (deviceQuery) 的输出:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 770"
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1997 MBytes (2093875200 bytes)
(008) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 1110 MHz (1.11 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
这是我正在尝试运行的非常简单的向量加法程序:
#include <cuda_runtime.h>
#include <iostream>
#include <cuda.h>
using namespace std;
int *a, *b; // host data
int *c; // results
__global__ void vecAdd(int *A,int *B,int *C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
}
int main(int argc,char **argv)
{
printf("Begin \n");
int n=1000000;
int nBytes = n*sizeof(int);
int block_size, block_no;
a = (int *)malloc(nBytes);
b = (int *)malloc(nBytes);
c = (int *)malloc(nBytes);
int *a_d,*b_d,*c_d;
block_size=1000;
block_no = n/block_size;
for(int i=0;i<n;i++) {
a[i]=i;
b[i]=i;
}
printf("Allocating device memory on host..\n");
cudaMalloc((void **)&a_d,n*sizeof(int));
cudaMalloc((void **)&b_d,n*sizeof(int));
printf("Copying to device..\n");
cudaMemcpy(a_d,a,n*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(b_d,b,n*sizeof(int),cudaMemcpyHostToDevice);
printf("Doing GPU Vector add\n");
vecAdd<<<block_no,block_size>>>(a_d,b_d,c_d);
cudaMemcpy(c,c_d,n*sizeof(int),cudaMemcpyDeviceToHost);
for(int i = 0; i < 10; i++) {
std::cout << a[i] << " + " << b[i] << " = " << c[i] << std::endl;
}
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
free(a);
free(b);
free(c);
return 0;
}
最后但并非最不重要的是,这是错误的输出:
Begin
Allocating device memory on host..
Copying to device..
Doing GPU Vector add
0 + 0 = 0
1 + 1 = 0
2 + 2 = 0
3 + 3 = 0
4 + 4 = 0
5 + 5 = 0
6 + 6 = 0
7 + 7 = 0
8 + 8 = 0
9 + 9 = 0
非常感谢任何帮助。
【问题讨论】:
-
您有多个问题。任何版本的 CUDA 11.x 都不支持计算能力 3.0 设备。切换到 CUDA 10.x。还要注意给出的答案,你的代码确实坏了。此外,在遇到 CUDA 代码问题时,请使用 proper CUDA error checking。
-
这很有帮助,谢谢。而且我现在看到代码已损坏(也就是说,如果您指的是 d_c 变量缺少的 cudaMalloc)。就像我在回答那个答案时所说的那样,问题仍然存在。希望安装正确版本的 CUDA 可以解决此问题。
标签: c++ parallel-processing cuda gpu