CPU 数量增加会降低性能，CPU 负载不变且无通信答案

【问题标题】：Increasing number of CPUs decreases performance, with cpu load constant and no communicationsCPU 数量增加会降低性能，CPU 负载不变且无通信
【发布时间】：2018-02-09 12:33:20
【问题描述】：

我遇到了一个我无法解释的有趣现象。我还没有在网上找到答案，因为大多数帖子都涉及弱扩展和通信开销。

这里有一小段代码来说明问题。这是用不同的语言测试的，结果相似，因此有多个标签。

#include <mpi.h>
#include <stdio.h>
#include <time.h>

int main() {

    MPI_Init(NULL,NULL);

    int wsize;
    MPI_Comm_size(MPI_COMM_WORLD, &wsize);

    int wrank;
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);


    clock_t t;

    MPI_Barrier(MPI_COMM_WORLD);

    t=clock();

    int imax = 10000000;
    int jmax = 1000;
    for (int i=0; i<imax; i++) {
        for (int j=0; j<jmax; j++) {
            //nothing
        }
    }

    t=clock()-t;

    printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );

    MPI_Finalize();

    return 0;

}

现在您可以看到，这里唯一计时的部分是循环。因此，对于类似的 CPU、没有超线程和足够的 RAM，增加 CPU 的数量应该会产生完全相同的时间。

但是，在我的机器上是 32 核和 15GiB RAM，

mpirun -np 1 ./test

给了

 proc 0 took 22.262777 seconds.

但是

mpirun -np 20 ./test

给了

 proc 18 took 24.440767 seconds.
 proc 0 took 24.454365 seconds.
 proc 4 took 24.461191 seconds.
 proc 15 took 24.467632 seconds.
 proc 14 took 24.469728 seconds.
 proc 7 took 24.469809 seconds.
 proc 5 took 24.461639 seconds.
 proc 11 took 24.484224 seconds.
 proc 9 took 24.491638 seconds.
 proc 2 took 24.484953 seconds.
 proc 17 took 24.490984 seconds.
 proc 16 took 24.502146 seconds.
 proc 3 took 24.513380 seconds.
 proc 1 took 24.541555 seconds.
 proc 8 took 24.539808 seconds.
 proc 13 took 24.540005 seconds.
 proc 12 took 24.556068 seconds.
 proc 10 took 24.528328 seconds.
 proc 19 took 24.585297 seconds.
 proc 6 took 24.611254 seconds.

对于不同数量的 CPU，值介于两者之间。

htop 还显示 RAM 消耗增加（VIRT 为 1 核约为 100M，20 核约为 300M）。虽然这可能与 mpi 通信器的大小有关？

最后，它肯定与问题的大小有关（因此无论循环的大小如何，都不会导致持续延迟的通信开销）。事实上，将 imax 降低到 10 000 会使 walltime 相似。

1 个核心：

 proc 0 took 0.028439 seconds.

20 核：

 proc 1 took 0.027880 seconds.
 proc 12 took 0.027880 seconds.
 proc 8 took 0.028024 seconds.
 proc 16 took 0.028135 seconds.
 proc 17 took 0.028094 seconds.
 proc 19 took 0.028098 seconds.
 proc 7 took 0.028265 seconds.
 proc 9 took 0.028051 seconds.
 proc 13 took 0.028259 seconds.
 proc 18 took 0.028274 seconds.
 proc 5 took 0.028087 seconds.
 proc 6 took 0.028032 seconds.
 proc 14 took 0.028385 seconds.
 proc 15 took 0.028429 seconds.
 proc 0 took 0.028379 seconds.
 proc 2 took 0.028367 seconds.
 proc 3 took 0.028291 seconds.
 proc 4 took 0.028419 seconds.
 proc 10 took 0.028419 seconds.
 proc 11 took 0.028404 seconds.

已在多台机器上进行了尝试，结果相似。也许我们遗漏了一些非常简单的东西。

感谢您的帮助！

【问题讨论】：

也许调度多个核心/线程需要相当长的时间。您需要在每个线程中进行比调度程序调度它们更多的工作。此外，如果您的代码受内存带宽限制，则会降低性能。
再次重申：C 和 C++ 是两种非常不同的语言。除非有充分的理由你不应该同时标记两者！顺便说一句：fortran 与此有什么关系？
@muXXmit2X 我放了几个标签，因为这是用不同的语言测试的，结果相似。不过我应该提到它。
你考虑过对L3缓存的影响吗？随着越来越多的处理器争夺有限的高速缓存，RAM 中的内存读取量将会增加。
您是否将 MPI 任务绑定到核心？你使用的 MPI 任务比核心多吗？

标签： c++ c fortran mpi

【解决方案1】：

具有受温度限制的涡轮频率的处理器。

现代处理器受到热设计功耗 (TDP) 的限制。每当处理器变冷时，单核可能会加速到涡轮倍频器。当热或多个非空闲内核时，内核将减慢到保证的基本速度。基本速度和涡轮速度之间的差异通常在 400MHz 左右。即使低于基本速度，AVX 或 FMA3 也可能会减速。

【讨论】：

我认为热保护切入不太可能导致报告的放缓。处理器似乎运行了大约 25 秒。
一种方法是禁用 Turbo 模式并再次运行测试