创建距离矩阵？答案

【问题标题】：Creating a Distance Matrix?创建距离矩阵？
【发布时间】：2015-06-11 10:53:01
【问题描述】：

我目前正在将数据读入如下所示的数据框中。

City         XCord    YCord   
Boston         5        2
Phoenix        7        3
New York       8        1
.....          .        .

我想根据这些数据创建一个欧几里得距离矩阵，显示所有城市对之间的距离，所以我得到一个结果矩阵，如下所示：

             Boston    Phoenix   New York
Boston         0        2.236      3.162
Phoenix        2.236      0        2.236
New York       3.162    2.236        0

在我的实际数据框中有更多的城市和坐标，所以我需要能够以某种方式遍历所有城市对并创建一个距离矩阵，就像我上面显示的那样，但我不知道如何将所有引用配对并应用欧几里得距离公式？任何帮助将不胜感激。

【问题讨论】：

你有任何代码吗？请至少提供一个代码，您可以在其中将这些距离读入内存以获得类似 cords[boston] = (5, 2)
现在我正在读取这样的 CSV 文件：Data = pd.read_csv('C:\Users\Jerry\Desktop\cities.csv')
see also

标签： python numpy dataframe

【解决方案1】：

我认为你对distance_matrix 很感兴趣。

例如：

创建数据：

import pandas as pd
from scipy.spatial import distance_matrix

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

输出：

          xcord ycord
Boston      5   7
Phoenix     7   3
New York    8   1

使用距离矩阵函数：

 pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)

结果：

          Boston    Phoenix     New York
Boston    0.000000  4.472136    6.708204
Phoenix   4.472136  0.000000    2.236068
New York  6.708204  2.236068    0.000000

【讨论】：

【解决方案2】：

如果您不想使用 scipy，您可以通过这种方式利用列表理解：

dist = lambda p1, p2: sqrt(((p1-p2)**2).sum())
dm = np.asarray([[dist(p1, p2) for p2 in xy_list] for p1 in xy_list])

【讨论】：

【解决方案3】：

我将给出一个纯python中的方法。

从数学模块导入一个 sqrt 函数：

from math import sqrt

假设您通过以下方式将坐标保存在绳索表中：

cords['Boston'] = (5, 2)

定义一个函数来计算两个给定二维点的欧几里得距离：

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1])

将生成的矩阵初始化为字典：

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)

D 是你得到的矩阵

完整的来源和打印结果如下：

from math import sqrt

cords = {}
cords['Boston'] = (5, 2)
cords['Phoenix'] = (7, 3)
cords['New York'] = (8, 1)

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1]) 

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)   

for city1, v in D.items():
    for city2, d in v.items():
        print city1, city2, d

结果：

Boston Boston 0.0
Boston New York 3.16227766017
Boston Phoenix 2.2360679775
New York Boston 3.16227766017
New York New York 0.0
New York Phoenix 2.2360679775
Phoenix Boston 2.2360679775
Phoenix New York 2.2360679775
Phoenix Phoenix 0.0

【讨论】：

【解决方案4】：

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

n_df=(df.values)
n_df

(df.values).shape

matrix=np.zeros(((df.values).shape[0],(df.values).shape[0]))
matrix


for i in range((df.values).shape[0]):
    for j in range((df.values).shape[0]):
        matrix[i,j]=np.sqrt(np.sum((n_df[i]-n_df[j])**2))
        #print('i',i,'j',j)


print(matrix)

【讨论】：

您能否为 OP 描述一下为什么这种方法会改进或提供一个很好的替代方案来替代已经很好的问题答案？
什么是最小可重现示例？ *.com/help/minimal-reproducible-example

【解决方案5】：

Refer

import pandas as pd
import numpy as np

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
x, y = df.xcord.to_numpy(), df.ycord.to_numpy()
x_y = df.values

%%timeit
pd.DataFrame(
    np.hypot(
        np.subtract.outer(x, x),
        np.subtract.outer(y, y)
    ),
    index=df.index, columns=df.index
)
# 32.9 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
pd.DataFrame(distance_matrix(x_y, x_y), index=df.index, columns=df.index)
# 49.8 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

同样与普通的自定义编写的 sqrt 方法相比，hypot 更能抵抗上溢和下溢

下溢

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

溢出

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

没有下溢

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

无溢出

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

【讨论】：

【解决方案6】：

在 scipy 中有这样的功能： scipy.spatial.distance.cdist()

【讨论】：