如何并行计算成对距离矩阵？答案

【问题标题】：How to parallelize computation of pairwise distance matrix?如何并行计算成对距离矩阵？
【发布时间】：2017-06-29 15:00:44
【问题描述】：

我的问题大致如下。给定一个数字矩阵 X，其中每一行是一个项目。我想根据除自身之外的所有行中的 L2 距离找到每一行的最近邻居。我尝试阅读官方文档，但对于如何实现这一点仍然有些困惑。有人可以给我一些提示吗？

我的代码如下

function l2_dist(v1, v2)
    return sqrt(sum((v1 - v2) .^ 2))
end

function main(Mat, dist_fun)
    n = size(Mat, 1)

    Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
    Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
    @parallel for i = 1:n
        Dist[i] = Inf
        Id[i] = 0
    end
    Threads.@threads for i in 1:n
        for j in 1:n
            if i != j
                println(i, j)
                dist_temp = dist_fun(Mat[i, :], Mat[j, :])
                if dist_temp < Dist[i]
                    println("Dist updated!")
                    Dist[i] = dist_temp
                    Id[i] = j
                end
            end
        end
    end
    return Dict("Dist" => Dist, "Id" => Id)
end

n = 4000
p = 30

X = [rand() for i in 1:n, j in 1:p];


main(X[1:30, :], l2_dist)
@time N = main(X, l2_dist)

我正在尝试将所有 i （即计算每行最小值）分布在不同的核心上。但是上面的版本显然不能正常工作。它甚至比顺序版本还要慢。有人可以指出我正确的方向吗？谢谢。

【问题讨论】：

我建议在以任何其他方式优化之前将该字典从内部循环中取出。字典对运行时有害。如果你真的想要一个字典，可以在没有字典的情况下创建数组，然后在循环后将它们添加到字典中。这可能会有所帮助。
@ChrisRackauckas 谢谢。我想我可以只创建两个数组，并且只在最后一步使它们成为字典。你对并行化部分有什么提示吗？
“您对并行化部分有任何提示吗？”，除了使用Threads.@threads 对其进行多线程处理，然后通过每个线程使用不同的数组并随后合并使其成为线程安全？不，应该很标准。如果我有时间，我可以把它写下来。
@ChrisRackauckas 非常感谢。我刚刚尝试按照您对字典的建议修改代码（现在在问题中更新）。只是这样做给了我 20 倍的加速。太棒了！
是否打算使用dist_fun(Mat[i, ], Mat[j, ]) 而不是dist_fun(Mat[i,:], Mat[j,:])？前者要快得多，但后者实际上给出了正确的答案。在表达式之前添加@views 可以提高速度/内存。

标签： parallel-processing julia distance-matrix

【解决方案1】：

也许除了你写下的内容之外，你还在做一些事情，但是，从我所见，你实际上并没有并行进行任何计算。 Julia 要求您告诉它您希望它访问多少个处理器（或线程）。您可以通过以下任一方式完成此操作

使用多个处理器启动 Julia julia -p #（其中 # 是您希望 Julia 有权访问的处理器数量）
启动 Julia“会话”后，您可以调用 addprocs 函数来添加额外的处理器。
要拥有多个线程，您需要运行命令export JULIA_NUM_THREADS = #。我不太了解线程，所以我将坚持使用@parallel 宏。我建议阅读documentation 以了解有关线程的更多详细信息——也许@Chris Rackauckas 可以进一步扩大差异。

下面是关于我的代码和你的代码的一些 cmets：

我使用的是0.6.1-pre.0 版本。我不认为我在做任何特定于 0.6 的事情，但这是为了以防万一。
在计算向量之间的距离时，我将使用Distances.jl 包。我认为将尽可能多的计算分包给编写良好且维护良好的包是一个好习惯。
我将计算列之间的距离，而不是计算行之间的距离。这是因为 Julia 是一种以列为主的语言，所以这会增加缓存命中的数量并提供一点额外的速度。显然，您只需转置输入即可获得所需的逐行结果。
除非您期望有那么多内存分配，否则那么多分配表明您的代码中的某些内容效率低下。这通常是类型稳定性问题。我不知道您之前的代码中是否存在这种情况，但在当前版本中这似乎不是问题（我并不清楚为什么您有这么多分配）。

代码如下

# Make sure all processors have access to Distances package
@everywhere using Distances

# Create a random matrix
nrow = 30
ncol = 4000

# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)

function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
    # Get size of the matrix
    nrow, ncol = size(X)

    # Create `SharedArray` to store output
    ind_vec = SharedArray{Int}(ncol)
    dist_vec = SharedArray{Float64}(ncol)

    # Compute the distance between columns
    @sync @parallel for i in 1:ncol
        # Initialize various temporary variables
        min_dist_i = Inf
        min_ind_i = -1
        X_i = view(X, :, i)

        # Check distance against all other columns
        for j in 1:ncol
            # Skip comparison with itself
            if i==j
                continue
            end

            # Tell us who is doing the work 
            # (can uncomment if you want to verify stuff)
            # println("Column $i compared with Column $j by worker $(myid())")

            # Evaluate the new distance... 
            # If it is less then replace it, otherwise proceed
            dist_temp = evaluate(M, X_i, view(X, :, j))
            if dist_temp < min_dist_i
                min_dist_i = dist_temp
                min_ind_i = j
            end
        end

        # Which column is minimum distance from column i
        dist_vec[i] = min_dist_i
        ind_vec[i] = min_ind_i
    end

    return dist_vec, ind_vec
end

# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)

@time main(X, metric);
@show dist[[1, 5, 25]], inds[[1, 5, 25]]

你可以运行代码

1 个处理器 julia testfile.jl

% julia testfile.jl
  0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
  (dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

n 个处理器（在本例中为 4 个）julia -p n testfile.jl

% julia -p 4 testfile.jl
  0.201523 seconds (2.10 k allocations: 99.107 KiB)
  (dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

【讨论】：

非常感谢您的回答！这正是我想要的！