【问题标题】：Pythonic way to calculate distance using numpy matrices?使用 numpy 矩阵计算距离的 Pythonic 方法？
【发布时间】：2016-06-27 14:06:42
【问题描述】：

我有一个numpy 矩阵中的点列表，

A = [[x11,x12,x13],[x21,x22,x23] ]

我有一个点原点o= [o1,o2,o3]，我必须从中计算每个点的距离，

A - o 将从每个点减去 o。目前我必须做每个属性的平方和加法操作，我在 for 循环中做。有没有更直观的方法来做到这一点？

P.S：我正在将上述计算作为 kmeans 集群应用程序的端口。我已经计算了质心，现在我必须计算质心每个点的距离。

input_mat = input_data_per_minute.values[:,2:5]

scaled_input_mat = scale2(input_mat)

k_means = cluster.KMeans(n_clusters=5)

print 'training start'
k_means.fit(scaled_input_mat)
print 'training over'

out = k_means.cluster_centers_

我必须计算input_mat 和每个簇质心之间的距离。

【问题讨论】：

从scipy查看cdist。

标签： python numpy

【解决方案1】：

你应该能够做这样的事情：（假设我没看错你的问题；））

In [1]: import numpy as np

In [2]: a = np.array([[11,12,13],[21,22,23]])

In [3]: o = [1,2,3]

In [4]: a - o  # just showing
Out[4]: 
array([[10, 10, 10],
       [20, 20, 20]])

In [5]: a ** 2  # just showing
Out[5]: 
array([[121, 144, 169],
       [441, 484, 529]])

In [6]: b = (a ** 2) + (a - o)

In [7]: b
Out[7]: 
array([[131, 154, 179],
       [461, 504, 549]])

Numpy 很棒，因为它在数组元素中移动！这意味着 90+% 的时间你可以在没有 for 循环的情况下迭代数组。在数组外使用 for 循环也会明显变慢。

【讨论】：

【解决方案2】：

Numpy 解决方案：

Numpy 非常适合广播，因此您可以欺骗它一步完成所有距离。但会根据点数和聚类中心的数量而消耗大量内存。实际上它会创建一个number_of_points * number_of_cluster_centers * 3 数组：

首先你需要了解一点关于广播的知识，我会自己玩并手动定义每个维度。

我将首先定义一些点和中心来进行说明：

import numpy as np

points = np.array([[1,1,1],
                   [2,1,1],
                   [1,2,1],
                   [5,5,5]])

centers = np.array([[1.5, 1.5, 1],
                    [5,5,5]])

现在我将准备这些数组，以便我可以使用 numpy 广播来获取每个维度的距离：

distance_3d = points[:,None,:] - centers[None,:,:]

实际上，第一个维度现在是点“标签”，第二个维度是中心“标签”，第三个维度是坐标。减法是得到每个维度的距离。结果将有一个形状：

(number_of_points, number_of_cluster_centers, 3)

现在只需应用欧式距离公式即可：

# Square each distance
distance_3d_squared = distance_3d ** 2

# Take the sum of each coordinates distance (the result will be 2D)
distance_sum = np.sum(distance_3d_squared, axis=2)

# And take the square root
distance = np.sqrt(distance_sum)

对于我的测试数据，最终结果是：

#array([[ 0.70710678,  6.92820323],
#       [ 0.70710678,  6.40312424],
#       [ 0.70710678,  6.40312424],
#       [ 6.36396103,  0.        ]])

所以distance[i, j] 元素将为您提供点i 到中心j 的距离。

总结：

您可以将所有这些放在一行中：

distance2 = np.sqrt(np.sum((points[:,None,:] - centers[None,:,:]) ** 2, axis=2))

Scipy 解决方案（更快更短）：

或者如果你有 scipy 使用 cdist:

from scipy.spatial.distance import cdist
distance3 = cdist(points, centers)

结果将始终相同，但cdist 对于许多点和中心来说是最快的。

【讨论】：