【发布时间】:2018-02-09 12:33:20
【问题描述】:
我遇到了一个我无法解释的有趣现象。我还没有在网上找到答案,因为大多数帖子都涉及弱扩展和通信开销。
这里有一小段代码来说明问题。这是用不同的语言测试的,结果相似,因此有多个标签。
#include <mpi.h>
#include <stdio.h>
#include <time.h>
int main() {
MPI_Init(NULL,NULL);
int wsize;
MPI_Comm_size(MPI_COMM_WORLD, &wsize);
int wrank;
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
clock_t t;
MPI_Barrier(MPI_COMM_WORLD);
t=clock();
int imax = 10000000;
int jmax = 1000;
for (int i=0; i<imax; i++) {
for (int j=0; j<jmax; j++) {
//nothing
}
}
t=clock()-t;
printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );
MPI_Finalize();
return 0;
}
现在您可以看到,这里唯一计时的部分是循环。因此,对于类似的 CPU、没有超线程和足够的 RAM,增加 CPU 的数量应该会产生完全相同的时间。
但是,在我的机器上是 32 核和 15GiB RAM,
mpirun -np 1 ./test
给了
proc 0 took 22.262777 seconds.
但是
mpirun -np 20 ./test
给了
proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.
对于不同数量的 CPU,值介于两者之间。
htop 还显示 RAM 消耗增加(VIRT 为 1 核约为 100M,20 核约为 300M)。虽然这可能与 mpi 通信器的大小有关?
最后,它肯定与问题的大小有关(因此无论循环的大小如何,都不会导致持续延迟的通信开销)。事实上,将 imax 降低到 10 000 会使 walltime 相似。
1 个核心:
proc 0 took 0.028439 seconds.
20 核:
proc 1 took 0.027880 seconds.
proc 12 took 0.027880 seconds.
proc 8 took 0.028024 seconds.
proc 16 took 0.028135 seconds.
proc 17 took 0.028094 seconds.
proc 19 took 0.028098 seconds.
proc 7 took 0.028265 seconds.
proc 9 took 0.028051 seconds.
proc 13 took 0.028259 seconds.
proc 18 took 0.028274 seconds.
proc 5 took 0.028087 seconds.
proc 6 took 0.028032 seconds.
proc 14 took 0.028385 seconds.
proc 15 took 0.028429 seconds.
proc 0 took 0.028379 seconds.
proc 2 took 0.028367 seconds.
proc 3 took 0.028291 seconds.
proc 4 took 0.028419 seconds.
proc 10 took 0.028419 seconds.
proc 11 took 0.028404 seconds.
已在多台机器上进行了尝试,结果相似。 也许我们遗漏了一些非常简单的东西。
感谢您的帮助!
【问题讨论】:
-
也许调度多个核心/线程需要相当长的时间。您需要在每个线程中进行比调度程序调度它们更多的工作。此外,如果您的代码受内存带宽限制,则会降低性能。
-
再次重申:C 和 C++ 是两种非常不同的语言。除非有充分的理由你不应该同时标记两者!顺便说一句:fortran 与此有什么关系?
-
@muXXmit2X 我放了几个标签,因为这是用不同的语言测试的,结果相似。不过我应该提到它。
-
你考虑过对L3缓存的影响吗?随着越来越多的处理器争夺有限的高速缓存,RAM 中的内存读取量将会增加。
-
您是否将 MPI 任务绑定到核心?你使用的 MPI 任务比核心多吗?