使用 Numpy 和 Cython 加速距离矩阵计算答案

【问题标题】：Speeding up distance matrix computation with Numpy and Cython使用 Numpy 和 Cython 加速距离矩阵计算
【发布时间】：2014-10-02 12:43:03
【问题描述】：

考虑一个维度为 NxM 的 numpy 数组 A。目标是计算欧几里得距离矩阵 D，其中每个元素 D[i,j] 是行 i 和 j 之间的欧几里得距离。最快的方法是什么？这不完全是我需要解决的问题，但它是我正在尝试做的一个很好的例子（通常，可以使用其他距离度量）。

这是迄今为止我能想到的最快的：

n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
    D[i] = np.sqrt(np.square(A-A[i]).sum(1))

但这是最快的方法吗？我主要关心for循环。我们可以用 Cython 来打败它吗？

为了避免循环，我尝试使用广播，并执行以下操作：

D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))

但结果证明这是个坏主意，因为在构建维度为 NxNxM 的中间 3D 数组时存在一些开销，因此性能更差。

我试过 Cython。但是我是 Cython 的新手，所以我不知道我的尝试有多好：

def dist(np.ndarray[np.int32_t, ndim=2] A):
    cdef int n = A.shape[0]    
    cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)      
    cdef int i = 0    
    for i in range(n):  
        dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)              
    return dm

上面的代码比 Python 的 for 循环慢一点。我对 Cython 了解不多，但我认为我至少可以达到与 for 循环 + numpy 相同的性能。我想知道如果以正确的方式完成，是否有可能实现一些显着的性能改进？或者是否有其他方法可以加快速度（不涉及并行计算）？

【问题讨论】：

N 和 M 有多大？在 Python 中而不是 NumPy 中执行 N 循环当然会减慢您的速度，但它并不像执行 NxM 循环那么糟糕。它真的太慢了，还是你只是为了优化它？
另外，为此，在 Cython 中编写一个 ufunc 可能更容易，然后将其爆破 A，而不是将整个循环放在 Cython 中。这样就不会出错了，如果没有别的……
There's a SciPy method specifically for performing this task，所以这可能是一个相当快的选择。
@user2357112，是的，刚刚尝试了 scipy，速度非常快，谢谢。但我仍然需要弄清楚如何实现这一点，因为这只是我遇到的更普遍问题的一个例子。
对于 Cython，如果您正在使用它，您可能希望自己进行数学运算，而不是调用 NumPy 例程。当您已经在编写编译为 C 的代码时，NumPy 向量化并没有太大帮助。

标签： python performance optimization numpy cython

【解决方案1】：

Cython 的关键是尽可能避免使用 Python 对象和函数调用，包括对 numpy 数组的向量化操作。这通常意味着手动写出所有循环并一次对单个数组元素进行操作。

very useful tutorial here 涵盖了将 numpy 代码转换为 Cython 并对其进行优化的过程。

以下是距离函数更优化的 Cython 版本的快速测试：

import numpy as np
cimport numpy as np
cimport cython

# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt

# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
@cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
@cython.wraparound(False)
def dist(double [:, :] A):

    # declare C types for as many of our variables as possible. note that we
    # don't necessarily need to assign a value to them at declaration time.
    cdef:
        # Py_ssize_t is just a special platform-specific type for indices
        Py_ssize_t nrow = A.shape[0]
        Py_ssize_t ncol = A.shape[1]
        Py_ssize_t ii, jj, kk

        # this line is particularly expensive, since creating a numpy array
        # involves unavoidable Python API overhead
        np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)

        double tmpss, diff

    # another advantage of using Cython rather than broadcasting is that we can
    # exploit the symmetry of D by only looping over its upper triangle
    for ii in range(nrow):
        for jj in range(ii + 1, nrow):
            # we use tmpss to accumulate the SSD over each pair of rows
            tmpss = 0
            for kk in range(ncol):
                diff = A[ii, kk] - A[jj, kk]
                tmpss += diff * diff
            tmpss = sqrt(tmpss)
            D[ii, jj] = tmpss
            D[jj, ii] = tmpss  # because D is symmetric

    return D

我将它保存在一个名为 fastdist.pyx 的文件中。我们可以使用pyximport 来简化构建过程：

import pyximport
pyximport.install()
import fastdist
import numpy as np

A = np.random.randn(100, 200)

D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)

print np.allclose(D1, D2)
# True

至少它是有效的。让我们使用%timeit 魔法做一些基准测试：

%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop

%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop

约 9 倍的加速是不错的，但并不是真正的游戏规则改变者。不过，正如您所说，广播方法的最大问题是构造中间数组的内存需求。

A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop

我不建议尝试使用广播...

我们可以做的另一件事是使用prange 函数在最外层循环上并行化：

from cython.parallel cimport prange

...

for ii in prange(nrow, nogil=True, schedule='guided'):
...

为了编译并行版本，您需要告诉编译器启用 OpenMP。我还没有弄清楚如何使用pyximport 来做到这一点，但如果你使用gcc，你可以像这样手动编译它：

$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
   -Wall -fno-strict-aliasing  -I/usr/include/python2.7 -o fastdist.so fastdist.c

具有并行性，使用 8 个线程：

%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop

【讨论】：

非常感谢！这看起来很有希望！
@ojy 很高兴你发现它有帮助。我刚刚意识到我的初始版本效率很低，因为它循环了D 中的每个元素，而不仅仅是上三角形。更新后的单线程版本再次快了大约两倍。
是的，我注意到了，但还没有机会尝试，不能等到星期一 :) 非常感谢！
终于试过了！工作得很好！我更多地研究了如何进一步提高并行化，因为它只导致大约 3 倍的加速，即使在我可以访问的 24 个内核上也是如此。 Saullo Castro 从这个问题*.com/questions/19002486/… 找到了一个非常有用的答案。这个想法是有一个单独的例程将被并行调用，并且只传递指向数据数组的指针。它给了我额外的 5 倍加速。
@ojy 我有点惊讶它为您带来了如此大的性能差异，尽管我认为这可能取决于包括您的编译器在内的许多其他因素。有时有帮助的另一件事是将距离函数声明为 inline（例如 cdef inline void mydist(...) nogil:），这为 C 编译器提供了优化该函数的额外提示（通常通过将函数的代码替换为其调用者）。您也可以尝试改变正在使用的 OpenMP 线程数 - 24 可能过多，除非 A 非常大。