【发布时间】:2015-02-15 07:09:18
【问题描述】:
我正在尝试优化我的表面检测内核;给定输入二进制512w x 1024h 图像,我想找到图像中的第一个亮面。我编写的代码声明了 512 个线程,并在 3x3 邻域中搜索第一个亮像素。代码运行良好,但在~9.46 ms 有点慢,我想让它运行得更快。
编辑 1: 性能提高了不到我的原始内核运行时间的一半。 Robert 的内核在我的 Quadro K6000 上以 4.032 ms 运行。
编辑 2: 设法通过将线程数减半来进一步提高性能。现在,我的(Robert 修改后的)内核在我的 Quadro K6000 上以2.125 ms 运行。
内核被调用:
firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
我想使用共享内存来改进内存获取;关于如何优化这段代码的任何想法?
__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int rows2skip = 10;
float thresh = 1.0f;
//thread Index: (0 -> 511)
if (col < width) {
if( col == 0 ) { // first col - 0
for (int row = 0 + rows2skip; row < height - 2; row++) { // skip first 30 rows
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the right
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col+1)]; if(neibs[1] == thresh) { cnt++; } // right
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col+1)]; if(neibs[3] == thresh) { cnt++; } // bottom right
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom right
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
else if ( col == (width-1) ) { // last col
for (int row = 0 + rows2skip; row < height -2; row++) {
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the left
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col-1)]; if(neibs[1] == thresh) { cnt++; } // left
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; } // bottom left
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col-1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom left
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
// remaining threads are: (1 -> 510)
else { // any col other than first or last column
for (int row = 0 + rows2skip; row < height - 2; row++) {
int cnt = 0;
float neibs[9]; // not shared mem as it reduces speed
// for threads < width/4, get the neighbors
// get nine neighbours - three in curr col, three each to left and right
neibs[0] = threshImg[((row)*width) +(col-1)]; if(neibs[0] == thresh) { cnt++; }
neibs[1] = threshImg[((row)*width) +(col)]; if(neibs[1] == thresh) { cnt++; }
neibs[2] = threshImg[((row)*width) +(col+1)]; if(neibs[2] == thresh) { cnt++; }
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; }
neibs[4] = threshImg[((row+1)*width) +(col)]; if(neibs[4] == thresh) { cnt++; }
neibs[5] = threshImg[((row+1)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; }
neibs[6] = threshImg[((row+2)*width) +(col-1)]; if(neibs[6] == thresh) { cnt++; }
neibs[7] = threshImg[((row+2)*width) +(col)]; if(neibs[7] == thresh) { cnt++; }
neibs[8] = threshImg[((row+2)*width) +(col+1)]; if(neibs[8] == thresh) { cnt++; }
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
}
__syncthreads();
}
【问题讨论】:
-
我无法回答这个问题,但请确保您使用的是the latest CUDA drivers
-
512 个线程不足以让 GPU 保持忙碌。如果你对性能感兴趣,你永远不想启动像这样的内核:
<<<1,...>>>或这样的:<<<...,1>>>你已经暴露了图像宽度上的并行性,现在是时候暴露整个图像的并行性了图像的高度。摆脱你的 for 循环并将网格增加到足够多的线程(可能想要转到 2D 网格)以让每个线程处理一个像素,而不是让每个线程处理一列。 -
一旦你的线程数增加了,你就可以通过将图像数据块加载到共享内存中来以非常简单的方式使用共享内存,并让每个线程在共享内存区域之外工作负载和测试。我不知道为什么人们将
__syncthreads()放在内核的末尾。它在那里毫无用处。 -
@RobertCrovella 我认为您一定是指将 2D 方形数据块加载到共享内存中,就像 CUDA 编程指南中的矩阵乘法示例一样?
-
顺便说一句,这个结构:
threshImg[((row+2)*width)在我看来确实有可能索引出threshImg的边界。也许你的 for 循环应该停在row < height -2