【发布时间】:2019-01-01 00:34:48
【问题描述】:
我尝试使用 CUDA c++ 对许多向量值求和。我找到了两个向量的一些解决方案。如您所见,可以添加两个向量,但我想动态生成具有相同长度的向量。
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
int main( int argc, char* argv[] )
{
// Size of vectors
int n = 100000;
// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and the print result divided by n, this should equal 1
within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}
有没有办法为许多向量做到这一点?我的向量大小是:
#vector length
N = 1000
#number of vectors
i = 300000
v[i] = [1,2,..., N]
结果我需要得到:
out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]
感谢您的建议。
【问题讨论】:
-
您给出的 CUDA C++ 示例与您要实现的目标之间存在一些混淆。您想减少每个向量并将所有向量的单独总和(减少)存储到数组中吗? 或者你想对所有向量执行元素相加吗?
-
sum(v)是什么意思?是减法操作吗?您显示的代码是向量加法,这根本不是一回事 -
从上面的例子可以清楚地看出,我们可以只添加两个向量。我想计算 N 个向量的总和。我的意思是我们可以动态生成具有相同长度的向量( V = 300000 #number of vectors )。所以我声明了向量的数量,程序应该给我所有向量的总和。
-
sum(v) 表示我们声明为参数的所有向量的总和
-
如果有人帮助我达到适当的结果,我将不胜感激
标签: cuda