cuda 线程和块答案

【问题标题】：cuda threads and blockscuda 线程和块
【发布时间】：2012-05-13 14:03:12
【问题描述】：

我在 NVIDIA 论坛上发布了这个，我想我会得到更多的帮助。

我在尝试扩展我的代码以执行多种情况时遇到了麻烦。我一直在考虑最常见的情况进行开发，现在是测试的时候了，我需要确保它适用于不同的情况。目前我的内核是在一个循环中执行的（有一些原因我们没有做一个内核调用来完成整个事情。）以计算矩阵行中的值。最常见的情况是 512 列乘 512 行。我需要考虑大小为 512 x 512、1024 x 512、512 x 1024 和其他组合的矩阵，但最大的矩阵将是 1024 x 1024 矩阵。我一直在使用一个相当简单的内核调用：

launchKernel<<<1,512>>>(................)

此内核适用于常见的 512x512 和 512 x 1024（分别为列、行）情况，但不适用于 1024 x 512 情况。这种情况需要 1024 个线程来执行。在我的天真中，我一直在尝试不同版本的简单内核调用来启动 1024 个线程。

launchKernel<<<2,512>>>(................)  // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???

我相信我的问题与我对线程和块缺乏了解有关

这是 deviceQuery 的输出，你可以看到我最多可以有 1024 个线程

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2050"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2688 MBytes (2818572288 bytes)
  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores
  GPU Clock Speed:                               1.15 GHz
  Memory Clock rate:                             1500.00 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                Yes
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           40 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro 600"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors x (48) CUDA Cores/MP:     96 CUDA Cores
  GPU Clock Speed:                               1.28 GHz
  Memory Clock rate:                             800.00 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 131072 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           15 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600

我只使用 Tesla C2050 设备这是我的内核的剥离版本，因此您可以了解它在做什么。

#define twoPi               6.283185307179586
#define speed_of_light      3.0E8
#define MaxSize             999

__global__ void calcRx4CPP4
(  
        const float *array1,  
        const double *array2,  
        const float scalar1,  
        const float scalar2,  
        const float scalar3,  
        const float scalar4,  
        const float scalar5,  
        const float scalar6,  
        const int scalar7,  
        const int scalar8,    
        float *outputArray1,
        float *outputArray2)  
{  

    float scalar9;  
    int idx;  
    double scalar10;
    double scalar11;  
    float sumReal, sumImag;  
    float real, imag;  

    float coeff1, coeff2, coeff3, coeff4;  

    sumReal = 0.0;  
    sumImag = 0.0;  

    // kk loop 1 .. 512 (scalar7)  
    idx = (blockIdx.x * blockDim.x) + threadIdx.x;  

    /* Declare the shared memory parameters */
    __shared__ float SharedArray1[MaxSize];
    __shared__ double SharedArray2[MaxSize];

    /* populate the arrays on shared memory */
    SharedArray1[idx] = array1[idx];  // first 512 elements
    SharedArray2[idx] = array2[idx];
    if (idx+blockDim.x < MaxSize){
        SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
        SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
    }            
    __syncthreads();

    // input scalars used here.
    scalar10 = ...;
    scalar11 = ...;

    for (int kk = 0; kk < scalar8; kk++)
    {  
        /* some calculations */
        // SharedArray1, SharedArray2 and scalar9 used here
        sumReal = ...;
        sumImag = ...;
    }  


    /* calculation of the exponential of a complex number */
    real = ...;
    imag = ...;
    coeff1 = (sumReal * real);  
    coeff2 = (sumReal * imag);  
    coeff3 = (sumImag * real);  
    coeff4 = (sumImag * imag);  

    outputArray1[idx] = (coeff1 - coeff4);  
    outputArray2[idx] = (coeff2 + coeff3);  


}

因为我每块的最大线程数是 1024，我以为我可以继续使用简单的内核启动，我错了吗？

如何成功启动每个具有 1024 个线程的内核？

【问题讨论】：

实际问题是什么？什么不工作？如果你有 1 个块和 1024 个线程，你需要一个大小为 1024 的共享数组，而不是 MaxSize = 999 的索引。
MaxSize = 999，与线程无关，它只是复制到共享内存的数组的大小。每个线程都必须遍历整个数组以获得总和（实际算法的简化）。问题是当我尝试在 512 列和 1024 行的情况下使用 1024 个线程时它不会工作。结果 outputArray1/2 没有被完全填满，所以所有 1024 个线程都没有被执行。
我不知道你的内核做了什么，但如果你使用 1024 个线程并使用“SharedArray1[idx] = array1[idx];”写入 SM你会明确地写出范围，因为数组的大小是 999 并且 idx 可以是范围 [0-1023]

标签： cuda

【解决方案1】：

您不想改变每个块的线程数。您应该使用 CUDA 占用计算器为您的内核获得每个块的最佳线程数。获得该数字后，您只需启动获得所需线程总数所需的块数。如果给定情况所需的线程数并不总是每个块的线程数的倍数，则在内核顶部添加代码以中止不需要的线程。 (if () return;)。然后，您可以使用额外的参数将矩阵的维度传递给内核，或者使用 x 和 y 网格维度，具体取决于内核中需要哪些信息（我还没有研究过）。

我的猜测是，您在使用 1024 个线程时遇到问题的原因是，即使您的 GPU 在一个块中支持这么多线程，但每个块中可以拥有的线程数量还有另一个限制因素，具体取决于内核中的资源使用情况。限制因素可以是共享内存或寄存器使用。占用计算器会告诉您哪些信息，但该信息仅在您想优化内核时才重要。

【讨论】：

对于那些感兴趣的人，CUDA 占用计算器在这里：developer.download.nvidia.com/compute/cuda/…
“即使你的 GPU 在一个块中支持这么多线程，根据内核中的资源使用情况，每个块中可以拥有的线程数量还有另一个限制因素。”那么 LaunchKernel>>(..) 不应该工作吗，因为这是每个块 512 个线程和 2 个块？
"该内核在常见的 512x512 和 512 x 1024（分别为列、行）情况下工作正常，但不适用于 1024 x 512 情况。这种情况需要 1024 个线程才能执行。那么对于 (x, y) 大小的矩阵，您每次使用 x 个线程调用内核 y 次？
您需要修复 @djmj 告诉您的 MaxSize 问题。看起来最好的方法是动态配置 __shared__ 数组的大小以匹配您正在启动的线程数。为此，您使用第三个内核参数。有关如何将内核中多个共享数组的大小基于单个动态大小的__shared__ 数组的示例，请参见 CUDA C 编程指南 4.2 中的 B.2.3。
我没有看到对共享数组的越界访问。谢谢，这导致内核启动返回未知错误。我现在可以在一个块中使用 1024 个线程。

【解决方案2】：

如果你使用一个有 1024 个线程的块，你会遇到问题，因为 MaxSize 只有 999，导致数据错误。

让我们为最后一个线程 #1023 模拟它

__shared__ float SharedArray1[999];     
__shared__ double SharedArray2[999];

/* populate the arrays on shared memory */     
SharedArray1[1023] = array1[1023]; 
SharedArray2[1023] = array2[1023];     

if (2047 < MaxSize)
{         
    SharedArray1[2047] = array1[2047];         
    SharedArray2[2047] = array2[2047];     
}                 
__syncthreads();

如果您现在在计算中使用所有这些元素，这应该不起作用。（您的计算代码未显示，因此这是一个假设）

【讨论】：