【问题标题】:Increasing number of CPUs decreases performance, with cpu load constant and no communicationsCPU 数量增加会降低性能,CPU 负载不变且无通信
【发布时间】:2018-02-09 12:33:20
【问题描述】:

我遇到了一个我无法解释的有趣现象。我还没有在网上找到答案,因为大多数帖子都涉及弱扩展和通信开销。

这里有一小段代码来说明问题。这是用不同的语言测试的,结果相似,因此有多个标签。

#include <mpi.h>
#include <stdio.h>
#include <time.h>

int main() {

    MPI_Init(NULL,NULL);

    int wsize;
    MPI_Comm_size(MPI_COMM_WORLD, &wsize);

    int wrank;
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);


    clock_t t;

    MPI_Barrier(MPI_COMM_WORLD);

    t=clock();

    int imax = 10000000;
    int jmax = 1000;
    for (int i=0; i<imax; i++) {
        for (int j=0; j<jmax; j++) {
            //nothing
        }
    }

    t=clock()-t;

    printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );

    MPI_Finalize();

    return 0;

}

现在您可以看到,这里唯一计时的部分是循环。因此,对于类似的 CPU、没有超线程和足够的 RAM,增加 CPU 的数量应该会产生完全相同的时间。

但是,在我的机器上是 32 核和 15GiB RAM,

mpirun -np 1 ./test 

给了

 proc 0 took 22.262777 seconds.

但是

mpirun -np 20 ./test

给了

 proc 18 took 24.440767 seconds.
 proc 0 took 24.454365 seconds.
 proc 4 took 24.461191 seconds.
 proc 15 took 24.467632 seconds.
 proc 14 took 24.469728 seconds.
 proc 7 took 24.469809 seconds.
 proc 5 took 24.461639 seconds.
 proc 11 took 24.484224 seconds.
 proc 9 took 24.491638 seconds.
 proc 2 took 24.484953 seconds.
 proc 17 took 24.490984 seconds.
 proc 16 took 24.502146 seconds.
 proc 3 took 24.513380 seconds.
 proc 1 took 24.541555 seconds.
 proc 8 took 24.539808 seconds.
 proc 13 took 24.540005 seconds.
 proc 12 took 24.556068 seconds.
 proc 10 took 24.528328 seconds.
 proc 19 took 24.585297 seconds.
 proc 6 took 24.611254 seconds.

对于不同数量的 CPU,值介于两者之间。

htop 还显示 RAM 消耗增加(VIRT 为 1 核约为 100M,20 核约为 300M)。虽然这可能与 mpi 通信器的大小有关?

最后,它肯定与问题的大小有关(因此无论循环的大小如何,都不会导致持续延迟的通信开销)。事实上,将 imax 降低到 10 000 会使 walltime 相似。

1 个核心:

 proc 0 took 0.028439 seconds.

20 核:

 proc 1 took 0.027880 seconds.
 proc 12 took 0.027880 seconds.
 proc 8 took 0.028024 seconds.
 proc 16 took 0.028135 seconds.
 proc 17 took 0.028094 seconds.
 proc 19 took 0.028098 seconds.
 proc 7 took 0.028265 seconds.
 proc 9 took 0.028051 seconds.
 proc 13 took 0.028259 seconds.
 proc 18 took 0.028274 seconds.
 proc 5 took 0.028087 seconds.
 proc 6 took 0.028032 seconds.
 proc 14 took 0.028385 seconds.
 proc 15 took 0.028429 seconds.
 proc 0 took 0.028379 seconds.
 proc 2 took 0.028367 seconds.
 proc 3 took 0.028291 seconds.
 proc 4 took 0.028419 seconds.
 proc 10 took 0.028419 seconds.
 proc 11 took 0.028404 seconds.

已在多台机器上进行了尝试,结果相似。 也许我们遗漏了一些非常简单的东西。

感谢您的帮助!

【问题讨论】:

  • 也许调度多个核心/线程需要相当长的时间。您需要在每个线程中进行比调度程序调度它们更多的工作。此外,如果您的代码受内存带宽限制,则会降低性能。
  • 再次重申:C 和 C++ 是两种非常不同的语言。除非有充分的理由你不应该同时标记两者!顺便说一句:fortran 与此有什么关系?
  • @muXXmit2X 我放了几个标签,因为这是用不同的语言测试的,结果相似。不过我应该提到它。
  • 你考虑过对L3缓存的影响吗?随着越来越多的处理器争夺有限的高速缓存,RAM 中的内存读取量将会增加。
  • 您是否将 MPI 任务绑定到核心?你使用的 MPI 任务比核心多吗?

标签: c++ c fortran mpi


【解决方案1】:

具有受温度限制的涡轮频率的处理器。

现代处理器受到热设计功耗 (TDP) 的限制。每当处理器变冷时,单核可能会加速到涡轮倍频器。当热或多个非空闲内核时,内核将减慢到保证的基本速度。基本速度和涡轮速度之间的差异通常在 400MHz 左右。即使低于基本速度,AVX 或 FMA3 也可能会减速。

【讨论】:

  • 我认为热保护切入不太可能导致报告的放缓。处理器似乎运行了大约 25 秒。
  • 一种方法是禁用 Turbo 模式并再次运行测试
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多