与常规卷积算法相比，快速卷积算法有不同的输出？答案

【问题标题】：Fast convolution algorithm have different outputs compared to regular convolution algorithm?与常规卷积算法相比，快速卷积算法有不同的输出？
【发布时间】：2020-12-17 17:11:23
【问题描述】：

我试图通过使用将我的图像数据转换为列向量并将我的卷积问题本质上转换为矩阵乘法问题的技术来加速卷积层中的前向传递。

[来自 https://sahnimanas.github.io/post/anatomy-of-a-high-performance-convolution/ 的想法]

我首先从 caffe 的官方 github 实现了一个 im2col 函数

void im2col_cpu(float* data_im, const int channels,
    const int height, const int width, const int kernel_h, const int kernel_w,
    const int pad,const int stride,const int dilation,float* data_col) {
  int dil_kernel_h = (kernel_h - 1) * dilation + 1;
  int dil_kernel_w = (kernel_w - 1) * dilation + 1;
  int height_col = (height + 2 * pad - dil_kernel_h) / stride + 1;
  int width_col = (width + 2 * pad - dil_kernel_w) / stride + 1;
  int channels_col = channels * kernel_h * kernel_w;
  
  #pragma omp parallel for
  for (int c = 0; c < channels_col; ++c) {
    int w_offset = c % kernel_w;
    int h_offset = (c / kernel_w) % kernel_h;
    int c_im = c / kernel_h / kernel_w;

    const int hc0 = h_offset * dilation - pad;
    const int wc0 = w_offset * dilation - pad;
    for (int h = 0; h < height_col; ++h) {
      int h_pad = h * stride + hc0;

      const int row_offset = (c * height_col + h) * width_col;
      const int srow_offset = (c_im * height + h_pad) * width;


      for (int w = 0; w < width_col; ++w) {
        int w_pad = w * stride + wc0;
        if ((h_pad < height) && (w_pad < width))
          *(data_col + row_offset + w) = *(data_im + srow_offset + w_pad);
      }
    }
  }
}

然后将输出与自定义矩阵乘法代码相乘。

void mat_mul(float *A, float *B, float *C, int M, int N, int K, bool has_bias) {
  int i, j, k;

  if (!has_bias)  init(C, M, N); //init() converts C into a 0 matrix
  
  # pragma omp parallel for private(i, j, k)
  for (i = 0; i < M; ++i) {
    for (k = 0; k < K; ++k) {
      float * ptr_c = &C[i * N];
      float * ptr_b = &B[k * N];
      float * ptr_a = &A[i * K + k];
      for (j = 0; j < N; ++j) {
        *(ptr_c+j) += *ptr_a * *(ptr_b + j);
      }
      
    }
  }
}

因此，我的卷积代码如下：

void Conv2d(Tensor input, Tensor weight, Tensor bias, Tensor output, int stride, int pad, int dilation, bool has_bias) {
  int C = input.shape[0], H = input.shape[1], W = input.shape[2];
  int K = weight.shape[0], R = weight.shape[2], S = weight.shape[3];
  int OH = output.shape[1], OW = output.shape[2];
  CHECK_ERROR(OH == (H + 2 * pad - dilation * (R - 1) - 1) / stride + 1, "Output height mismatch");
  CHECK_ERROR(OW == (W + 2 * pad - dilation * (S - 1) - 1) / stride + 1, "Output width mismatch");
  CHECK_ERROR(weight.shape[1] == C && (!has_bias || bias.shape[0] == K) && output.shape[0] == K, "Channel size mismatch");

  float* col = (float *)malloc(sizeof(float) * (C * R * S * H * W));
  im2col_cpu(input.buf, C, H, W, R, S, pad, stride, dilation, col);
  mat_mul(weight.buf, col, output.buf, K, OH * OW, R * S * C, has_bias);

  free(col);

}

然而，事实证明我的代码不与使用标准非常慢算法的常规卷积相同，因为使用矩阵乘法方法的输出与使用以下方法的输出不匹配。

void Conv2d(Tensor input, Tensor weight, Tensor bias, Tensor output, int stride, int pad, int dilation, bool has_bias) {
  int C = input.shape[0], H = input.shape[1], W = input.shape[2];
  int K = weight.shape[0], R = weight.shape[2], S = weight.shape[3];
  int OH = output.shape[1], OW = output.shape[2];
  CHECK_ERROR(OH == (H + 2 * pad - dilation * (R - 1) - 1) / stride + 1, "Output height mismatch");
  CHECK_ERROR(OW == (W + 2 * pad - dilation * (S - 1) - 1) / stride + 1, "Output width mismatch");
  CHECK_ERROR(weight.shape[1] == C && (!has_bias || bias.shape[0] == K) && output.shape[0] == K, "Channel size mismatch");

  for (int k = 0; k < K; ++k) {
    for (int oh = 0; oh < OH; ++oh) {
      for (int ow = 0; ow < OW; ++ow) {
        float o = has_bias ? bias.buf[k] : 0;
        for (int c = 0; c < C; ++c) {
          for (int r = 0; r < R; ++r) {
            for (int s = 0; s < S; ++s) {
              int h = oh * stride - pad + r * dilation;
              int w = ow * stride - pad + s * dilation;
              if (h < 0 || h >= H || w < 0 || w >= W) continue;
              float i = input.buf[c * H * W + h * W + w];
              float f = weight.buf[k * C * R * S + c * R * S + r * S + s];
              o += i * f;
            }
          }
        }
        output.buf[k * OH * OW + oh * OW + ow] = o;
      }
    }
  }
}

关于为什么我的矩阵乘法代码不起作用的任何想法？

【问题讨论】：

不相关：很多人发现ptr_c[j] 比*(ptr_c+j) 更容易阅读（它也不那么冗长）。
哦，好的，谢谢。我曾经认为*(ptr_c + i) 使顺序内存访问比使用brackets ptr_c[i] 更快，这已经成为一种习惯。
它们将被编译成相同的东西，因此它们同样快:-)
是的！我想我需要提高代码的可读性
离题但是...为什么要使用诸如此类的蛮力方法，而不是使用基于傅里叶变换的算法。 fftw3?

标签： c++ optimization conv-neural-network openmp convolution

【解决方案1】：

哦，我发现了问题所在。在我的原始代码中，我将偏差设置为

float o = has_bias ? bias.buf[k] : 0;

其中 k 表示K 过滤器中的第 k 个过滤器。然而，在我的mat_mul 代码中，我天真地认为*(ptr_c+j) += *ptr_a * *(ptr_b + j); 会在最终输出中添加适量的偏差。

我已将代码改为：

void mat_mul(float *A, float *B, float *C, Tensor bias, int M, int N, int K, bool has_bias) {
  # pragma omp parallel for
  for (int i = 0; i < M; ++i) {
    int h_offset = i * N;
    for (int j = 0; j < N; ++j) {
        C[h_offset + j] = has_bias ? bias.buf[i] : 0;
    }
  }
  
  int i, j, k;
  # pragma omp parallel for private(i, j, k)
  for (i = 0; i < M; ++i) {
    for (k = 0; k < K; ++k) {
      int ptr_c = i * N;
      int ptr_b = k * N;
      int ptr_a = i * K + k;
      for (j = 0; j < N; ++j) {
        C[ptr_c+j] += A[ptr_a] * B[ptr_b + j];
      }
      
    }
  }
}

这将允许我添加与原始代码相同数量的偏差。

【讨论】：

请注意，即使在正确的并行程序中，并行执行的数字输出通常与顺序版本的输出略有或很大差异（取决于算法的数值稳定性）。这是由于浮点运算的不可交换性以及在线程之间移动数据时可能会损失内部精度。