将 for 循环转换为并行计算答案

【问题标题】：Convert for-loops to parallel computing将 for 循环转换为并行计算
【发布时间】：2020-09-10 19:56:18
【问题描述】：

我做了一个数值模拟。我有 50,000 个粒子在一个盒子中结晶，其中有几个晶体（具有周期性边界条件）。我试着在盒子里找到最大的水晶。该代码运行良好，但在我的计算机上大约需要 100 [s]。速率确定步骤是下面的函数find_neighbors()。

有没有办法使用并行计算来缩短计算时间？

我的电脑有 12 个核心。也许我们可以将数组（中心）分成 12 部分？

我的函数的输出是一个二维列表。

这是find_neighbors() 函数，没有并行计算：

# Calculate if two particles are neighbors (Periodic Boundary Condition)
#              center is the coordinates of the particles
#              d      is the distance threshold for two clusters to be neighbors
#              Maxima_Range and
#              Lattice_Range are for periodic boundary conditions
#              Pos_3d is a list.
# Suppose particle i has neighbors j,k,l.
#              Pos_3d[i] = [j,k,l]

def find_neighbors(center,d,Maxima_Range,Lattice_Range):
    a,b = np.shape(center)          # get the dimension of the array
    flag_3d = np.zeros(a)           # count the number of neighbors for particle i
    Pos_3d = [[] for i in range(a)] # Pos_3d is a two dimensional list. 
    #                               # Pos_3d[i] includes the indices of all the neighbors of particle i

    diff_sqrt = d ** 0.5
    for i in range(a):
        if i % 6 == 5:              # for some reason, when i % 6 == 5,
            #                       #     it is the center coordinates,
            #                       #     this cannot be changed.
            for j in range(i+1,a):
                if j % 6 == 5:

                    diff_x = center[i][0]-center[j][0]
                    diff_y = center[i][1]-center[j][1]
                    diff_z = center[i][2]-center[j][2]

                    if (   diff_x <  diff_sqrt
                       and diff_y <  diff_sqrt
                       and diff_z <  diff_sqrt
                       and diff_x > -diff_sqrt
                       and diff_y > -diff_sqrt
                       and diff_z > -diff_sqrt ):
                           #        # if one of the x,y,z component is already
                           #        #    bigger than d, we don't need
                           #        #    to calculate their distance in 3-D.
                           #        #    They are not neighbors.

                        diff = ( diff_x * diff_x
                               + diff_y * diff_y
                               + diff_z * diff_z
                                 )

                        if diff <= d: # without period boundary condition
                            flag_3d[i] +=1
                            flag_3d[j] +=1
                            Pos_3d[i].append(j)
                            Pos_3d[j].append(i)
                            continue
                            
                       
                        # With periodic boundary condition
                        # in x

                        do something..

    return Pos_3d

【问题讨论】：

您好，您使用的似乎是numpy。您应该使用此模块的内置函数来计算矢量计算。如果还不够，哪部分代码可以并行执行？

标签： python performance parallel-processing processing-efficiency parallelism-amdahl

【解决方案1】：

问：“有没有使用并行计算的方法来缩短计算时间？”

是的，有。

然而，在任何人开始尝试之前，修复您的原始代码以避免任何和所有开销：

最好避免列表，附加但更喜欢 numpy 可向量化和高效处理数据结构：

aVectorOfDIFFs = center[i] - center[j] # aVectorOfDIFFs.shape  ( 3, ) # _x, _y, _z
if d >= np.dot( aVectorOfDIFFs,        # diff == ( ( _x * _x )
                aVectorOfDIFFs.T       #         + ( _y * _y )
                ):                     #         + ( _z * _z ) )
    # makes sense to do anything ... with ( i, j )
    ...

除了智能自上而下地将代码重构为性能最佳的 numpy-矢量化处理（如果对最终性能感兴趣，我们可以稍后再谈），只是原始 for-loops 的普通串联效率非常低，大部分时间都花在了 python 多年来知道这一点，缓慢的循环迭代。

它可以运行（原样）`~ 20x`更快：

>>> from zmq import Stopwatch; aClk = Stopwatch()

>>> def aPairOfNaiveFORs( a = 1E6 ):
...     intA = int( a )                        #           a gets fused into (int)
...     for            i in range(      intA ):# original, keep looping all i-s
...         if         i % 6 == 5:             #        if i-th "center":
...             for    j in range( i+1, intA ):#           keep looping all j-s
...                 if j % 6 == 5:             #           if j-th "center":
...                     # NOP,                 #              do some usefull work
...                     # not important here   #
...                     # for the testing, the overhead comparison matters
...                     continue               #               loop next
...     return


>>> aClk.start(); _ = aPairOfNaiveFORs( 1E3 ); aClk.stop()
   14770 [us] for 1E3 ~  13+ [us] per one item of a[:1000]
   14248
   13430
   13252

>>> aClk.start(); _ = aPairOfNaiveFORs( 1E5 ); aClk.stop()
64559866 [us] for 1E5 ~ 645+ [us] per one item of a[:10000]

现在，避免所有不符合 5 == <index> modulo 6 的荒谬循环

>>> def aPairOfLessNaiveFORs( a = 1E6 ):
...     intA = int( a )
...     for     i in range (  5, intA, 6 ):    # yes, start with the 5-th item of the first matching "center"
...         for j in range( i+6, intA, 6 ):    # yes, keep matching ( modulo 6 ) steps
...               continue

测试显示 `~ +13x ~ +17x` 更快的代码（仅避免了 5/6 个循环）：

>>> aClk.start(); _ = aPairOfLessNaiveFORs( 1E3 ); aClk.stop()
    1171 [us] for 1E3 ~  1 [us] per one item of a[:1000]
     976
    1157
    1181
    1185

>>> aClk.start(); _ = aPairOfLessNaiveFORs( 1E5 ); aClk.stop()
 3725167 [us] for 1E5 ~ 37 [us] per one item of a[:10000]

恢复：

正是这种避免主要开销的精确工程将缩短处理在大约 5E5 个结晶中心（即 3E6 坐标）某处缩放的循环。

>>> aClk.start(); _ = aPairOfLessNaiveFORs( 5E5 ); aClk.stop()
 100104334 [us] for 5E5 ~ 17.6 x FASTER
>>> aClk.start(); _ =     aPairOfNaiveFORs( 5E5 ); aClk.stop()
1759420300 [us] for 5E5 ~ ^^^^^^^^^^^^^

即使it easily need not get any "positive"-boost 进入并行计算模型（记住corrected Amdahl's Law，这样做的附加开销不会被忽略或忽略）。

【讨论】：

它可以运行（原样）~ 20x更快：

测试显示 ~ +13x ~ +17x 更快的代码（仅避免了 5/6 个循环）：

恢复：

它可以运行（原样）`~ 20x`更快：

测试显示 `~ +13x ~ +17x` 更快的代码（仅避免了 5/6 个循环）：