CUDA - 优化表面检测内核答案

【问题标题】：CUDA - optimize surface detection kernelCUDA - 优化表面检测内核
【发布时间】：2015-02-15 07:09:18
【问题描述】：

我正在尝试优化我的表面检测内核；给定输入二进制512w x 1024h 图像，我想找到图像中的第一个亮面。我编写的代码声明了 512 个线程，并在 3x3 邻域中搜索第一个亮像素。代码运行良好，但在~9.46 ms 有点慢，我想让它运行得更快。

编辑 1： 性能提高了不到我的原始内核运行时间的一半。 Robert 的内核在我的 Quadro K6000 上以 4.032 ms 运行。

编辑 2： 设法通过将线程数减半来进一步提高性能。现在，我的（Robert 修改后的）内核在我的 Quadro K6000 上以2.125 ms 运行。

内核被调用：

firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);

我想使用共享内存来改进内存获取；关于如何优化这段代码的任何想法？

__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {

int col = threadIdx.x + (blockDim.x*blockIdx.x); 
int rows2skip = 10; 
float thresh = 1.0f;

 //thread Index: (0 -> 511)

if (col < width) {

    if( col == 0 ) { // first col - 0
        for (int row = 0 + rows2skip; row < height - 2; row++) { // skip first 30 rows
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the right 
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col+1)];           if(neibs[1] == thresh) { cnt++; }   // right
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col+1)];         if(neibs[3] == thresh) { cnt++; }   // bottom right
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom right

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }
    }

    else if ( col == (width-1) ) { // last col 
        for (int row = 0 + rows2skip; row < height -2; row++) { 
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the left
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col-1)];           if(neibs[1] == thresh) { cnt++; }   // left
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }   // bottom left
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col-1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom left

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }       
    }

    // remaining threads are: (1 -> 510) 

    else { // any col other than first or last column
        for (int row = 0 + rows2skip; row < height - 2; row++) { 

            int cnt = 0;
            float neibs[9]; // not shared mem as it reduces speed   

            // for threads < width/4, get the neighbors
            // get nine neighbours - three in curr col, three each to left and right
            neibs[0] = threshImg[((row)*width) +(col-1)];           if(neibs[0] == thresh) { cnt++; } 
            neibs[1] = threshImg[((row)*width) +(col)];             if(neibs[1] == thresh) { cnt++; } 
            neibs[2] = threshImg[((row)*width) +(col+1)];           if(neibs[2] == thresh) { cnt++; }           
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }           
            neibs[4] = threshImg[((row+1)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }           
            neibs[5] = threshImg[((row+1)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }           
            neibs[6] = threshImg[((row+2)*width) +(col-1)];         if(neibs[6] == thresh) { cnt++; }           
            neibs[7] = threshImg[((row+2)*width) +(col)];           if(neibs[7] == thresh) { cnt++; }           
            neibs[8] = threshImg[((row+2)*width) +(col+1)];         if(neibs[8] == thresh) { cnt++; }

            if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary

                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
                }
            }
        }       
    }           

__syncthreads();
}

【问题讨论】：

我无法回答这个问题，但请确保您使用的是the latest CUDA drivers
512 个线程不足以让 GPU 保持忙碌。如果你对性能感兴趣，你永远不想启动像这样的内核：<<<1,...>>> 或这样的：<<<...,1>>> 你已经暴露了图像宽度上的并行性，现在是时候暴露整个图像的并行性了图像的高度。摆脱你的 for 循环并将网格增加到足够多的线程（可能想要转到 2D 网格）以让每个线程处理一个像素，而不是让每个线程处理一列。
一旦你的线程数增加了，你就可以通过将图像数据块加载到共享内存中来以非常简单的方式使用共享内存，并让每个线程在共享内存区域之外工作负载和测试。我不知道为什么人们将__syncthreads() 放在内核的末尾。它在那里毫无用处。
@RobertCrovella 我认为您一定是指将 2D 方形数据块加载到共享内存中，就像 CUDA 编程指南中的矩阵乘法示例一样？
顺便说一句，这个结构：threshImg[((row+2)*width) 在我看来确实有可能索引出threshImg 的边界。也许你的 for 循环应该停在 row < height -2

标签： c++ cuda

【解决方案1】：

这是一个有效的示例，演示了 cmets 中讨论的 3 个概念中的 2 个：

要考虑的第一个优化是 512 个线程不足以让任何 GPU 保持忙碌。我们希望以 10000 个或更多线程为目标。 GPU 是一个隐藏延迟的机器，当你的线程太少而无法帮助 GPU 隐藏延迟时，你的内核就会成为延迟限制，这是一种内存限制问题。最直接的方法是让每个线程处理图像中的一个像素（总共允许 512*1024 个线程），而不是一列（总共只允许 512 个线程）。但是，由于这似乎“破坏”了我们的“第一表面检测”算法，我们还必须进行另一次修改 (2)。
一旦我们并行处理了所有像素，那么直接调整上面的第 1 项意味着我们不再知道哪个表面是“第一个”，即哪个“亮”表面（每列）最接近行0. 算法的这一特性将问题从简单的变换变为归约（实际上每列图像归约一次。）我们将允许每一列并行处理，通过为每个像素分配 1 个线程，但我们将选择满足亮度测试的结果像素，即 最接近 行零。一种相对简单的方法是在发现合适明亮像素邻域的最小行（每列）的每列一个数组上使用atomicMin。

以下代码演示了上述 2 项更改（不包括任何共享内存的使用）并演示了（对我而言，在 Tesla K40 上）与 OP 的原始内核相比，加速范围提高了 1 到 20 倍。加速的范围是由于算法的工作方式因图像结构而异。两种算法都有提前退出策略。由于 for 循环上的提前退出结构，原始内核可以做更多或更少的工作，这取决于在每列中发现“明亮”像素邻域的位置（如果有的话）。因此，如果所有列在第 0 行附近都有明亮的邻域，我会看到大约 1 倍的“改进”（即我的内核运行速度与原始速度大致相同）。如果所有列（仅）在图像的另一“端”附近都有明亮的邻域，我会看到大约 20 倍的改进。这可能因 GPU 而异，因为开普勒 GPU 提高了我正在使用的全局原子吞吐量。 编辑：由于这种可变工作，我添加了一个粗略的“提前退出”策略作为对我的代码的微不足道的修改。这带来了最短的执行时间来近似两个内核之间的奇偶校验（即大约 1 倍）。

剩余的优化可能包括：

使用共享内存。这应该是对用于例如矩阵乘法的相同基于图块的共享内存方法的简单改编。如果您使用方形瓷砖，那么您将需要调整内核块/网格尺寸以使这些“方形”。
改进的还原技术。对于某些图像结构，原子方法可能会有些慢。这可以通过切换到每列适当的并行减少来改善。您可以通过将测试图像设置为任何地方的阈值来对我的内核进行“最坏情况”测试。这应该会导致原始内核运行得最快，而我的内核运行得最慢，但在这种情况下我没有观察到内核的任何显着减速。我的内核的执行时间相当稳定。同样，这可能取决于 GPU。

例子：

#include <stdlib.h>
#include <stdio.h>

#define SKIP_ROWS 10
#define THRESH 1.0f

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {

int col = threadIdx.x + (blockDim.x*blockIdx.x); 
int rows2skip = SKIP_ROWS; 
float thresh = THRESH;

 //thread Index: (0 -> 511)

if (col < width) {

    if( col == 0 ) { // first col - 0
        for (int row = 0 + rows2skip; row < height; row++) { // skip first 30 rows
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the right 
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col+1)];           if(neibs[1] == thresh) { cnt++; }   // right
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col+1)];         if(neibs[3] == thresh) { cnt++; }   // bottom right
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom right

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }
    }

    else if ( col == (width-1) ) { // last col 
        for (int row = 0 + rows2skip; row < height; row++) { 
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the left
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col-1)];           if(neibs[1] == thresh) { cnt++; }   // left
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }   // bottom left
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col-1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom left

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }       
    }

    // remaining threads are: (1 -> 510) 

    else { // any col other than first or last column
        for (int row = 0 + rows2skip; row < height; row++) { 

            int cnt = 0;
            float neibs[9]; // not shared mem as it reduces speed   

            // for threads < width/4, get the neighbors
            // get nine neighbours - three in curr col, three each to left and right
            neibs[0] = threshImg[((row)*width) +(col-1)];           if(neibs[0] == thresh) { cnt++; } 
            neibs[1] = threshImg[((row)*width) +(col)];             if(neibs[1] == thresh) { cnt++; } 
            neibs[2] = threshImg[((row)*width) +(col+1)];           if(neibs[2] == thresh) { cnt++; }           
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }           
            neibs[4] = threshImg[((row+1)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }           
            neibs[5] = threshImg[((row+1)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }           
            neibs[6] = threshImg[((row+2)*width) +(col-1)];         if(neibs[6] == thresh) { cnt++; }           
            neibs[7] = threshImg[((row+2)*width) +(col)];           if(neibs[7] == thresh) { cnt++; }           
            neibs[8] = threshImg[((row+2)*width) +(col+1)];         if(neibs[8] == thresh) { cnt++; }

            if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary

                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
                }
            }
        }       
    }           

__syncthreads();
}

__global__ void firstSurfaceDetection_opt (const float * __restrict__ threshImg, int *firstSurfaceImgRow, int height, int width) {

  int col = threadIdx.x + (blockDim.x*blockIdx.x); 
  int row = threadIdx.y + (blockDim.y*blockIdx.y);

  int rows2skip = SKIP_ROWS; 
  float thresh = THRESH;

  if ((row >= rows2skip) && (row < height-2) && (col < width) && (row < firstSurfaceImgRow[col])) {

    int cnt = 0;
    int inc = 0;
    if (col == 0) inc = +1;
    if (col == (width-1)) inc = -1;
    if (inc){
            cnt = 3;
            if (threshImg[((row)*width)   +(col)]     == thresh) cnt++;
            if (threshImg[((row)*width)   +(col+inc)] == thresh) cnt++;
            if (threshImg[((row+1)*width) +(col)]     == thresh) cnt++;   
            if (threshImg[((row+1)*width) +(col+inc)] == thresh) cnt++;      
            if (threshImg[((row+2)*width) +(col)]     == thresh) cnt++;     
            if (threshImg[((row+2)*width) +(col+inc)] == thresh) cnt++;
            }
    else {
            // get nine neighbours - three in curr col, three each to left and right
            if (threshImg[((row)*width)   +(col-1)] == thresh) cnt++;
            if (threshImg[((row)*width)   +(col)]   == thresh) cnt++;
            if (threshImg[((row)*width)   +(col+1)] == thresh) cnt++;
            if (threshImg[((row+1)*width) +(col-1)] == thresh) cnt++;
            if (threshImg[((row+1)*width) +(col)]   == thresh) cnt++;   
            if (threshImg[((row+1)*width) +(col+1)] == thresh) cnt++;      
            if (threshImg[((row+2)*width) +(col-1)] == thresh) cnt++;
            if (threshImg[((row+2)*width) +(col)]   == thresh) cnt++;     
            if (threshImg[((row+2)*width) +(col+1)] == thresh) cnt++;
            }
    if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
            atomicMin(firstSurfaceImgRow + col, row);
            }
    }
}


int main(int argc, char *argv[]){

  float *threshImg, *h_threshImg, *firstSurfaceImg, *h_firstSurfaceImg;
  int *firstSurfaceImgRow, *h_firstSurfaceImgRow;
  int actualImHeight = 1024;
  int actualImWidth = 512;
  int row_set = 512;
  if (argc > 1){
    int my_val = atoi(argv[1]);
    if ((my_val > SKIP_ROWS) && (my_val < actualImHeight - 3)) row_set = my_val;
    }

  h_firstSurfaceImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
  h_threshImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
  h_firstSurfaceImgRow = (int *)malloc(actualImWidth*sizeof(int));
  cudaMalloc(&threshImg, actualImHeight*actualImWidth*sizeof(float));
  cudaMalloc(&firstSurfaceImg, actualImHeight*actualImWidth*sizeof(float));
  cudaMalloc(&firstSurfaceImgRow, actualImWidth*sizeof(int));
  cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
  cudaMemset(firstSurfaceImg, 0, actualImHeight*actualImWidth*sizeof(float));

  for (int i = 0; i < actualImHeight*actualImWidth; i++) h_threshImg[i] = 0.0f;
  // insert "bright row" here
  for (int i = (row_set*actualImWidth); i < ((row_set+3)*actualImWidth); i++) h_threshImg[i] = THRESH;

  cudaMemcpy(threshImg, h_threshImg, actualImHeight*actualImWidth*sizeof(float), cudaMemcpyHostToDevice);


  dim3 grid(1,1024);
  //warm-up run
  firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
  cudaDeviceSynchronize();
  cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
  cudaDeviceSynchronize();
  unsigned long long t2 = dtime_usec(0);
  firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
  cudaDeviceSynchronize();
  t2 = dtime_usec(t2);
  cudaMemcpy(h_firstSurfaceImgRow, firstSurfaceImgRow, actualImWidth*sizeof(float), cudaMemcpyDeviceToHost);
  unsigned long long t1 = dtime_usec(0);
  firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);
  cudaMemcpy(h_firstSurfaceImg, firstSurfaceImg, actualImWidth*actualImHeight*sizeof(float), cudaMemcpyDeviceToHost); 

  printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
  // validate results
  for (int i = 0; i < actualImWidth; i++) 
    if (h_firstSurfaceImgRow[i] < actualImHeight) 
      if (h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i] != THRESH)
        {printf("mismatch at %d, was %f, should be %d\n", i, h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i], THRESH); return 1;}
  return 0;
}
$ nvcc -arch=sm_35 -o t667 t667.cu
$ ./t667
t1 = 0.000978s, t2 = 0.000050s
$

注意事项：

上面的示例在图像的 row=512 处插入了一个“明亮的邻域”，因此在我的例子中提供了几乎 20 倍的中间加速因子 (K40c)。
为了简洁起见，我省略了proper cuda error checking。不过我推荐它。
任一内核的执行性能很大程度上取决于它是否首次运行。这可能与缓存和一般热身效果有关。因此，为了给出合理的结果，我首先运行我的内核作为额外的不定时预热运行。
我没有进行共享内存优化的一个原因是这个问题已经很小了，至少对于像 K40 这样的大型开普勒 GPU 来说，它几乎完全适合 L2 缓存（尤其是我的内核，因为它使用较小的输出数据结构。）鉴于此，共享内存可能不会带来太多性能提升。

编辑：我已经（再次）修改了代码，以便测试图像中插入明亮边界的行（行）可以作为命令行参数传递，并且我已经在 3 种不同的设备上测试了代码，对亮行使用了 3 种不同的设置：

execution time on:     K40    C2075    Quadro NVS 310
bright row =   15:   31/33    23/45       29/314
bright row =  512:  978/50  929/112     1232/805
bright row = 1000: 2031/71 2033/149    2418/1285

all times are microseconds (original kernel/optimized kernel)
CUDA 6.5, CentOS 6.2

【讨论】：

你说得对，我正在努力寻找第一条边。为此，我在每一列中找到有局部最大值的行。一旦我找到那一行，我就会跳出我的循环。那么，如果是这种情况，那么您的代码中的h_firstSurfaceImgRow 不应该初始化为cudaMemset(firstSurfaceImgRow, numeric_limits<int>::max(), actualImWidth*sizeof(int));，而不是将数组mem 设置为1？将最小行值与max_int_range 进行比较应该会产生图像中的第一条边。
我没有将数组设置为 1。我将其设置为 0x01010101（这就是 memset 的工作方式。）这个数字足够大，相当于 std::numeric_limits ::max()（出于此处的目的，最大值为 1024）。欢迎您使用 std::numeric_limits::max()。（但你不能用cudaMemset 做到这一点）我只是马虎。
Quadro K6000 的性能提高了不到原来内核运行时间的一半。接受并投票作为答案。
通过将线程数减少一半来进一步提高性能。在我的 K6000 上运行需要 ~2.125 ms。