为多线程重新组织嵌套循环答案

【问题标题】：Reorganizing nested loops for multithreading为多线程重新组织嵌套循环
【发布时间】：2021-12-04 00:25:34
【问题描述】：

我正在尝试重写物理模拟中的主循环并在更多线程之间分配工作负载。它在每对唯一的索引上调用dostuff，如下所示：

for (int i = 0; i < n - 1; ++i)
{
    for (int j = i + 1; j < n; ++j)
    {
        dostuff(i, j);
    }
}

我想出了两个选择：

//#1
//sqrt is implemented as binary search on ints, floors the result
for (int x = 0; x < n * (n - 1) / 2; ++x)
{
    int i = (1 + sqrt(1 + 8 * x)) / 2;
    int j = x - i * (i - 1) / 2;
    dostuff(i, j);
}
//#2
for (int x = 0; x < n * n; ++x)
{
    int i = x % n;
    int j = x / n;
    if (i < j)
        dostuff(i, j);
}

并且对于每个选项，都有对应的线程循环使用共享原子counter：

//#1
while(int x = counter.fetch_add(1) < n * (n - 1) / 2)
{
    int i = (1 + sqrt(1 + 8 * x)) / 2;
    int j = x - i * (i - 1) / 2;
    dostuff(i, j);
}
//#2
while(int x = counter.fetch_add(1) < n * n)
{
    int i = x % n;
    int j = x / n;
    if (i < j)
        dostuff(i, j);
}

我的问题是，对于n < 10^6，在线程之间共享主循环工作负载的最佳方式是什么？编辑：

//dostuff
Element& a = elements[i];
Element& b = elements[j];
glm::dvec3 r = b.getPosition() - a.getPosition();
double rv = glm::length(r);
double base = G / (rv * rv);
glm::dvec3 dir = glm::normalize(r);
glm::dvec3 bd = dir * base;
accelerations[i] += bd * b.getMass();
accelerations[j] -= bd * a.getMass();

【问题讨论】：

我会从 std::for_each(std::execution::par_unseq, ...) 开始（在一个或两个循环上），或者来自 TBB 的类似内容。
如何拆分工作负载很大程度上取决于您的硬件和操作系统，以及doStuff 实际做了多少工作。
~30 双乘法，2 sqrt & 2 读写向量是 doStuff 所做的。
@Niik 您使用的是什么操作系统和编译器？有供应商特定的并发库
由于您有向量访问权限，请尝试拆分工作负载，以便每个线程访问内存中的相同“邻居”。这将减少缓存未命中，这会对性能产生很大影响

标签： c++ multithreading

【解决方案1】：

你的作品是一个三角形。你想把三角形分成 k 个不同的部分。

如果 k 是 2 的幂，你可以这样做：

a
a a
b c d
b c d d

每个区域的大小都相同。

【讨论】：