【发布时间】:2013-04-19 03:05:28
【问题描述】:
如何进一步优化以下 CUDA 内核?还是已经针对其目的进行了优化?
我在想也许我可以在主机代码中使用__constant__ 内存,以便用随机数设置数组。这可能吗?我知道它是只读内存,所以我很困惑是否可以使用常量内存而不是 __global__ 内存。
/*
* CUDA kernel that will execute 100 threads in parallel
* and will populate these parallel arrays with 100 random numbers
* array size = 100.
*/
__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc,
float* opacity ,float* angle, unsigned char* color, int height,
int width, curandState* state, size_t pitch){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
curandState localState = state[idx];
posx[idx] = (float)(curand_normal(&localState)*width);
posy[idx] = (float)(curand_normal(&localState)*height);
rayon[idx] = (float)(10 + curand_normal(&localState)*50);
angle[idx] = (float)(curand_normal(&localState)*360);
veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
color[idx*pitch] = (unsigned char)(curand_normal(&localState)*255);
color[(idx*pitch)+1] = (unsigned char)(curand_normal(&localState)*255);
color[(idx*pitch)+2] = (unsigned char)(curand_normal(&localState)*255);
opacity[idx] = (float)(0.3f + 1.5f *curand_normal(&localState));
__syncthreads();
}
【问题讨论】:
-
为什么需要
__syncthreads? -
您需要针对哪种架构进行优化?费米?开普勒?
-
我需要优化计算能力为1.2的显卡。我认为那是费米?
-
为什么不使用线程数的倍数(即 32)经纱大小?另外,您是否使用过 NVIDIA 分析工具(例如
nvvp)?您是否使用优化标志进行编译?
标签: c++ optimization memory-management cuda nvidia