CUDA C++ 中的平滑卷积答案

【问题标题】：Smooth convolution in CUDA C++CUDA C++ 中的平滑卷积
【发布时间】：2021-12-02 19:02:24
【问题描述】：

我是 CUDA 的新手，我正在尝试在图像上实现平滑卷积，到目前为止我有这个，但结果是错误的。不确定我的偏移运动是否正确。有什么帮助吗？


__global__ void smooth(unsigned char* device_out_image, float kernel_size, unsigned char* device_input_imag, int height, int width)
{
    int pos_x = threadIdx.x + blockIdx.x * blockDim.x;//x coordinate of pixel
    int pos_y = threadIdx.y + blockIdx.y * blockDim.y;//y coordinate of pixel

    if (pos_x < width && pos_y < height)
    {
        unsigned char r = device_input_imag[pos_y * width + pos_x];//absolute pixel position
        unsigned char g = device_input_imag[(height + pos_y) * width + pos_x];
        unsigned char b = device_input_imag[(height * 2 + pos_y) * width + pos_x];
        //also mix value with the intensity instead of the range x
        float sumR = float(0.0f);
        float sumG = float(0.0f);
        float sumB = float(0.0f);
        for (int i = (-1 * 15); i <= 15; i++)
            for (int j = (-1 *15); j <= 15; j++)
            {
                if (pos_x + j > 0 && pos_y + i > 0 && pos_x + j <= width && pos_y + i <= height)
                {
                    sumR += (float)device_input_imag[(pos_y + i) * width + (pos_x + j)]/255.0;
                    sumG += (float)device_input_imag[(height + (pos_y + i)) * width + (pos_x + j)]/255.0;
                    sumB += (float)device_input_imag[(height * 2 + (pos_y + i)) * width + (pos_x + j)]/255.0;
                }   
            }
        sumR = sumR / (15 * 15);
        sumG = sumG / (15 * 15);
        sumB = sumB / (15 * 15);
        device_out_image[pos_y * width + pos_x] = (unsigned char)(sumR * 255.0);
        device_out_image[(height + pos_y) * width + pos_x] = (unsigned char)(sumG * 255.0) ;
        device_out_image[(height * 2 + pos_y) * width + pos_x] = (unsigned char)(sumB *255.0 );
        if (device_out_image[pos_y * width + pos_x] > 255)
            device_out_image[pos_y * width + pos_x] = 255;
        if (device_out_image[(height + pos_y) * width + pos_x] > 255)
            device_out_image[(height + pos_y) * width + pos_x] = 255;
        if (device_out_image[(height * 2 + pos_y) * width + pos_x] > 255)
            device_out_image[(height * 2 + pos_y) * width + pos_x] = 255;   
    }
}

【问题讨论】：

请尽量更清楚地解释您的问题。您可以重新编辑您的问题。
你的循环从 -15 到 15，所以你有 31*31 个值：检查 Peter 提到的上限条件

标签： cuda

【解决方案1】：

尝试以下调试步骤：

能否将输出图像设置为灰色渐变，其中所有 3 个通道都相同，并且值是 x 坐标？你呢？这将验证您的输出图像解释、memcpy、通道布局、大小等都是正确的。如果不正确，请继续挖掘，并使用结果的外观来帮助诊断任何问题。例如，如果您看到渐变效果，但它未对齐，则您可能有错误的行间距。如果结果不是灰色的，则您误解了 r、g 和 b 的位置。备份一个步骤，一次只设置一个通道。

您可以使用您拥有的相同内核框架将输入图像复制到输出图像吗？也就是说，注释掉循环，只需将输出值设置为输入值。这就验证了输入的解释、memcpy等是否正确。

转到大小为 3x3 的内核。然后回到整个事情。

其他一些提示：

您不需要浮点累加器。由于图像是由 8 位值组成的，因此即使使用更大的内核（32 * 32 * 256 是 18 位），您也永远不会溢出 32 位整数累加器。

仔细检查您的 > vs >=。 0 是可以的，所以你想测试idx>=0，但是宽度不行，所以你需要测试idx<=width-1 或者，更惯用的，idx<width。

在分配到内存位置之前执行输出范围限制。优化器几乎肯定会解决这个问题，但如果您钳制临时 sumR/sumG/sumB 值，您的代码也会看起来更简单并且更容易验证。

【讨论】：