计算混合实复矩阵向量积的最快方法是什么？答案

【问题标题】：What is the fastest way to compute mixed real-complex matrix-vector products?计算混合实复矩阵向量积的最快方法是什么？
【发布时间】：2020-03-23 18:08:58
【问题描述】：

我面临着尽可能快地计算矩阵向量乘积的问题，其中矩阵是严格实数且向量是复数。更准确地说，实矩阵是（反）对称的，并且具有非零密集块的稀疏结构。

到目前为止，我的策略是将向量分成实部和虚部，并为每个密集块计算实部和虚部的矩阵向量乘积。由于矩阵是（反）对称的，我打算同时计算一个块的乘积及其转置，所以我可以重新使用矩阵所在的缓存线。所以对于每个块，我计算 4 个矩阵- 块的向量积及其实部和虚部的转置。

我为单个区块计算这 4 个产品的代码最终如下所示：

#define no_alias __restrict__

template <typename VecType, typename MatType>
void trans_mul( const VecType* const no_alias re_in,
                VecType* const no_alias re_out, 
                const VecType* const no_alias im_in,
                VecType* const no_alias im_out, 
                const VecType* const no_alias re_tin,
                VecType* const no_alias re_tout,
                const VecType* const no_alias im_tin,
                VecType* const no_alias im_tout,
                const MatType* no_alias mat, // the matrix block
                const int rows, 
                const int cols)
{
    for(int j = 0; j < cols; ++j) {
        for(int i = 0; i < rows; ++i) {
            const auto m = *mat++;        // this is mat[i, j]
            re_tout[j]  += m * re_tin[i]; // transposed
            im_tout[j]  += m * im_tin[i]; // transposed
            re_out[i]   -= m * re_in[j];
            im_out[i]   -= m * im_in[j];
        }
    }
}

典型的矩阵大小为 10^2。我使用带有-Ofast -march=native 的GCC 9.2.1 编译我的代码。从assembly output可以看出编译器是自动向量化并使用SIMD指令。

我正在与用 Fortran 编写的类似代码竞争，它的运行速度仍然快了大约 25%。当然，我的代码非常幼稚，但我仍然想不出比这更快的东西，因为积极的优化似乎非常有效。我还尝试使用四个cblas_dgemv 调用，但这比我幼稚的方法要慢得多。我还有什么可以做的吗，或者有什么有用的 BLAS 例程可以适合我的情况？

【问题讨论】：

要非常小心 -Ofast 它违反了一些语言规则以提高您的速度，因此某些语言保证不再成立，并且在极端情况下您可能会遇到奇怪的错误。就我个人而言，我避免它并坚持对我的代码的-O2 与-O3 性能进行基准测试。我宁愿在一些微小的（通常是微不足道的甚至不存在的）加速上具有可预测的行为。
@JesperJuhl 感谢您的评论，但我的目标是便携性和速度，即使我在结果上失去了一些准确性（我确实这样做了）。 -O3 和 -Ofast 之间的加速差异在我的情况下被证明是非常显着的，大约快 125%。
-Ofast 的结果是但可移植的。正如我所说，它放弃了标准强制行为以提高速度，因此您不能保证与其他编译器或其他平台上的相同编译器相同的结果。
计算的矩阵/向量的典型大小是多少？请阅读this 以检查您是否可以在此处使用-Ofast。
@JérômeRichard 他们可能在 100-10k 之间。

标签： c++ caching optimization vectorization matrix-multiplication

【解决方案1】：

对于相当大的矩阵（例如 >=1k），您可以使用寄存器阻塞来提高性能（与算术运算相比，减少了内存加载/存储的数量）。对于小矩阵，很难比原代码做得更好。

这是带有寄存器阻塞的结果代码：

#define no_alias __restrict__
#define mini(a, b) (((a)<(b)) ? (a) : (b))

template <typename VecType, typename MatType>
void trans_mul_v2( const VecType* const no_alias re_in,
                VecType* const no_alias re_out, 
                const VecType* const no_alias im_in,
                VecType* const no_alias im_out, 
                const VecType* const no_alias re_tin,
                VecType* const no_alias re_tout,
                const VecType* const no_alias im_tin,
                VecType* const no_alias im_tout,
                const MatType* no_alias mat, // the matrix block
                const int rows, 
                const int cols)
{
    // Block size (tuned for Clang/GCC on Intel Skylake processors)
    const int si = 16;
    const int sj = 8;

    for(int bj = 0; bj < cols; bj+=sj) {
        for(int bi = 0; bi < rows; bi+=si) {
            if(bi+si <= rows && bj+sj <= cols)
            {
                // The underlying loops are expected to be unrolled by the compiler
                for(int j = bj; j < bj+sj; ++j) {
                    for(int i = bi; i < bi+si; ++i) {
                        const auto m = mat[j*rows+i]; // Assume a column major ordering
                        re_tout[j]  += m * re_tin[i];
                        im_tout[j]  += m * im_tin[i];
                        re_out[i]   -= m * re_in[j];
                        im_out[i]   -= m * im_in[j];
                    }
                }
            }
            else
            {
                // General case (borders)
                for(int j = bj; j < mini(bj+sj,cols); ++j) {
                    for(int i = bi; i < mini(bi+si,rows); ++i) {
                        const auto m = mat[j*rows+i];
                        re_tout[j]  += m * re_tin[i];
                        im_tout[j]  += m * im_tin[i];
                        re_out[i]   -= m * re_in[j];
                        im_out[i]   -= m * im_in[j];
                    }
                }
            }
        }
    }
}

注意si 和sj 的值对执行时间有很大影响。最佳值取决于编译器和底层架构。您可能应该针对目标机器调整它（如果您想要性能可移植性，尽管性能可能不是最佳，但请保持它们较小）。

以下是结果（GCC 9 使用双精度类型）：

With a row=100 and cols=100:
trans_mul_v1: 2.438 us
trans_mul_v2: 2.842 us

With a row=1k and cols=1k:
trans_mul_v1: 452 us
trans_mul_v2: 296 us

With a row=10k and cols=10k:
trans_mul_v1: 71.2 ms
trans_mul_v2: 35.9 ms

【讨论】：