使用 GCC 编译器优化多线程时，Eigen 3.3 共轭梯度变慢答案

【问题标题】：Eigen 3.3 Conjugate Gradient is slower when multi-threaded with GCC compiler optimization使用 GCC 编译器优化多线程时，Eigen 3.3 共轭梯度变慢
【发布时间】：2017-05-06 20:08:11
【问题描述】：

我一直在 Eigen 3.2 中使用 ConjugateGradient 求解器，并决定尝试升级到 Eigen 3.3.3，希望从新的多线程功能中受益。

遗憾的是，当我使用 GCC 4.8.4 启用 -fopenmp 时，求解器似乎变慢了 (~10%)。查看xosview，我看到所有8个cpu都在使用，但是性能更慢......

经过一些测试，我发现如果我禁用编译器优化（使用-O0 而不是-O3），那么-fopenmp 确实可以将求解器加速约50%。

当然，仅仅为了从多线程中受益而禁用优化并不值得，因为总体而言这会更慢。

根据https://stackoverflow.com/a/42135567/7974125 的建议，我将存储完整的稀疏矩阵并将Lower|Upper 作为UpLo 参数传递。

我还尝试了 3 个预条件子中的每一个，还尝试使用 RowMajor 矩阵，但无济于事。

还有什么可以尝试获得多线程和编译器优化的全部好处吗？

我无法发布我的实际代码，但这是使用来自Eigen's documentation 的拉普拉斯示例的快速测试，除了一些更改为使用ConjugateGradient 而不是SimplicialCholesky。（这两个求解器都适用于 SPD 矩阵。）

#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>

using namespace Eigen;
using namespace std;

// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;

// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
                       VectorXd& b, const VectorXd& boundary)
{
  int n = int(boundary.size());
  int id1 = i+j*n;
        if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
  else  if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
  else  coeffs.push_back(T(id,id1,w));              // unknown coefficient
}

void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
  b.setZero();
  ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
  for(int j=0; j<n; ++j)
  {
    for(int i=0; i<n; ++i)
    {
      int id = i+j*n;
      insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
      insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j,    4, coefficients, b, boundary);
    }
  }
}

int main()
{
  int n = 300;  // size of the image
  int m = n*n;  // number of unknowns (=number of pixels)
  // Assembly:
  vector<T> coefficients;          // list of non-zeros coefficients
  VectorXd b(m);                   // the right hand side-vector resulting from the constraints
  buildProblem(coefficients, b, n);
  SpMat A(m,m);
  A.setFromTriplets(coefficients.begin(), coefficients.end());
  // Solving:
  // Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
  BenchTimer t;
  t.reset(); t.start();
  ConjugateGradient<SpMat, Lower|Upper> solver(A);
  VectorXd x = solver.solve(b);         // use the factorization to solve for the given right hand side
  t.stop();
  cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER
  return 0;
}

结果输出：

// No optimization, without OpenMP
g++ cg.cpp -O0 -I./eigen -o cg
./cg
Real time: 23.9473

// No optimization, with OpenMP
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg
./cg
Real time: 17.6621

// -O3 optimization, without OpenMP
g++ cg.cpp -O3 -I./eigen -o cg
./cg
Real time: 0.924272

// -O3 optimization, with OpenMP
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg
./cg
Real time: 1.04809

【问题讨论】：

你必须尝试使用 openmp 来获得不同的线程数，使用 omp_set_num_threads 到 4。也许内存是你的瓶颈。启动 8 个线程，它们会争夺访问内存并降低性能。

标签： gcc openmp eigen eigen3

【解决方案1】：

您的问题太小，不能指望多线程带来任何好处。稀疏矩阵预计至少大一个数量级。在这种情况下，应该调整 Eigen 的代码以减少线程数。

此外，我猜您只有 4 个物理内核，因此使用 OMP_NUM_THREADS=4 ./cg 运行可能会有所帮助。

【讨论】：

如果你的意思是 4 个超线程内核，也可以使用 omp_places=cores （但这不是在 g++ windows 上实现的）
澄清一下，我的问题中的示例代码使用的是 90000*90000 矩阵。 int n = 300; // size of the image 这行可能具有误导性。未知数的实际数量为 n*n 个像素。 90000*90000还是太小了吗？
如果一个只运行 1 秒并且根据 cg 的顺序性质具有有限并行性的任务没有从并行性中受益，这可能并不奇怪。
对，90000*90000 已经够大了。在这里，使用您的代码，我得到 0.75s 没有 openmp 和 0.45s 与 openmp。我没有禁用 turboboost，因此顺序版本以更高的时钟频率运行。