上面的答案指出了块大小如何影响性能,并提出了一个基于占用最大化的通用启发式选择。在不想提供 标准来选择块大小的情况下,值得一提的是 CUDA 6.5(现在处于候选发布版本)包括几个新的运行时函数来帮助计算占用率和启动配置,请参阅
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration
其中一个有用的函数是cudaOccupancyMaxPotentialBlockSize,它启发式地计算达到最大占用率的块大小。然后可以将该函数提供的值用作手动优化启动参数的起点。下面是一个小例子。
#include <stdio.h>
/************************/
/* TEST KERNEL FUNCTION */
/************************/
__global__ void MyKernel(int *a, int *b, int *c, int N)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) { c[idx] = a[idx] + b[idx]; }
}
/********/
/* MAIN */
/********/
void main()
{
const int N = 1000000;
int blockSize; // The launch configurator returned block size
int minGridSize; // The minimum grid size needed to achieve the maximum occupancy for a full device launch
int gridSize; // The actual grid size needed, based on input size
int* h_vec1 = (int*) malloc(N*sizeof(int));
int* h_vec2 = (int*) malloc(N*sizeof(int));
int* h_vec3 = (int*) malloc(N*sizeof(int));
int* h_vec4 = (int*) malloc(N*sizeof(int));
int* d_vec1; cudaMalloc((void**)&d_vec1, N*sizeof(int));
int* d_vec2; cudaMalloc((void**)&d_vec2, N*sizeof(int));
int* d_vec3; cudaMalloc((void**)&d_vec3, N*sizeof(int));
for (int i=0; i<N; i++) {
h_vec1[i] = 10;
h_vec2[i] = 20;
h_vec4[i] = h_vec1[i] + h_vec2[i];
}
cudaMemcpy(d_vec1, h_vec1, N*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_vec2, h_vec2, N*sizeof(int), cudaMemcpyHostToDevice);
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, N);
// Round up according to array size
gridSize = (N + blockSize - 1) / blockSize;
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Occupancy calculator elapsed time: %3.3f ms \n", time);
cudaEventRecord(start, 0);
MyKernel<<<gridSize, blockSize>>>(d_vec1, d_vec2, d_vec3, N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Kernel elapsed time: %3.3f ms \n", time);
printf("Blocksize %i\n", blockSize);
cudaMemcpy(h_vec3, d_vec3, N*sizeof(int), cudaMemcpyDeviceToHost);
for (int i=0; i<N; i++) {
if (h_vec3[i] != h_vec4[i]) { printf("Error at i = %i! Host = %i; Device = %i\n", i, h_vec4[i], h_vec3[i]); return; };
}
printf("Test passed\n");
}
编辑
cudaOccupancyMaxPotentialBlockSize在cuda_runtime.h文件中定义,定义如下:
template<class T>
__inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize(
int *minGridSize,
int *blockSize,
T func,
size_t dynamicSMemSize = 0,
int blockSizeLimit = 0)
{
return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit);
}
参数含义如下
minGridSize = Suggested min grid size to achieve a full machine launch.
blockSize = Suggested block size to achieve maximum occupancy.
func = Kernel function.
dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func.
blockSizeLimit = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements.
请注意,从 CUDA 6.5 开始,需要根据 API 建议的 1D 块大小计算自己的 2D/3D 块尺寸。
另请注意,CUDA 驱动程序 API 包含用于计算占用率的功能等效 API,因此可以在驱动程序 API 代码中使用 cuOccupancyMaxPotentialBlockSize,其方式与上例中运行时 API 显示的方式相同。