【发布时间】:2015-11-17 22:18:15
【问题描述】:
我正在尝试使用 OpenMP 并行化一个循环,其中每次迭代都是独立的(下面的代码示例)。
!$OMP PARALLEL DO DEFAULT(PRIVATE)
do i = 1, 16
begin = omp_get_wtime()
allocate(array(100000000))
do j=1, 100000000
array(j) = j
end do
deallocate(array)
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$END OMP PARALLEL DO
我会排除这段代码的线性加速,每次迭代都需要与顺序版本一样多的时间,因为没有可能的竞争条件或错误共享问题。但是,我在具有 2 个 Xeon E5-2670(每个 8 个内核)的机器上获得了以下结果:
只有一个线程:
It 1 Thread 0 time 0.435683965682983
It 2 Thread 0 time 0.435048103332520
It 3 Thread 0 time 0.435137987136841
It 4 Thread 0 time 0.434695959091187
It 5 Thread 0 time 0.434970140457153
It 6 Thread 0 time 0.434894084930420
It 7 Thread 0 time 0.433521986007690
It 8 Thread 0 time 0.434685945510864
It 9 Thread 0 time 0.433223009109497
It 10 Thread 0 time 0.434834957122803
It 11 Thread 0 time 0.435106039047241
It 12 Thread 0 time 0.434649944305420
It 13 Thread 0 time 0.434831142425537
It 14 Thread 0 time 0.434768199920654
It 15 Thread 0 time 0.435182094573975
It 16 Thread 0 time 0.435090065002441
并且有 16 个线程:
It 1 Thread 0 time 1.14882898330688
It 3 Thread 2 time 1.19775915145874
It 4 Thread 3 time 1.24406099319458
It 14 Thread 13 time 1.28723978996277
It 8 Thread 7 time 1.39885497093201
It 10 Thread 9 time 1.46112895011902
It 6 Thread 5 time 1.50975203514099
It 11 Thread 10 time 1.63096308708191
It 16 Thread 15 time 1.69229602813721
It 7 Thread 6 time 1.74118590354919
It 9 Thread 8 time 1.78044819831848
It 15 Thread 14 time 1.82169485092163
It 12 Thread 11 time 1.86312794685364
It 2 Thread 1 time 1.90681600570679
It 5 Thread 4 time 1.96404480934143
It 13 Thread 12 time 2.00902700424194
知道迭代时间中的 4 倍因子来自哪里吗?
我已经使用 GNU 编译器和带有 O3 优化标志的 Intel 编译器进行了测试。
【问题讨论】: