选项 1:Scikit-learn NearestNeighbors 类
要找到 k 近邻 (kNN),sklearn.neighbors.NearestNeighbors 就可以达到目的。
数据
import numpy as np
import pandas as pd
np.random.seed(52) # reproducibility
df = pd.DataFrame(np.random.rand(4, 6))
print(df)
0 1 2 3 4 5
0 0.823110 0.026118 0.210771 0.618422 0.098284 0.620131
1 0.053890 0.960654 0.980429 0.521128 0.636553 0.764757
2 0.764955 0.417686 0.768805 0.423202 0.926104 0.681926
3 0.368456 0.858910 0.380496 0.094954 0.324891 0.415112
代码
from sklearn.neighbors import NearestNeighbors
k = 3
dist, indices = NearestNeighbors(n_neighbors=k).fit(df).kneighbors(df)
结果
print(dist)
array([[0.00000000e+00, 1.09330867e+00, 1.13862254e+00],
[0.00000000e+00, 9.32862532e-01, 9.72369661e-01],
[0.00000000e+00, 9.72369661e-01, 1.02130721e+00],
[2.10734243e-08, 9.32862532e-01, 1.02130721e+00]])
print(indices)
array([[0, 2, 3],
[1, 3, 2],
[2, 1, 3],
[3, 1, 2]])
获得的距离和索引可以很容易地重新排列。
选项 2:手动计算(除了自己最近)
sklearn.metrics 有一个内置的欧式距离函数,它输出一个形状为[#rows x #rows] 的数组。您可以从min() 和argmin() 中排除对角线元素(到自身的距离,即0),方法是用无穷大填充。
代码
from sklearn.metrics import euclidean_distances
dist = euclidean_distances(df.values, df.values)
np.fill_diagonal(dist, np.inf) # exclude self from min()
df_want = pd.DataFrame({
"point": range(df.shape[0]),
"closest_point": dist.argmin(axis=1),
"distance": dist.min(axis=1)
})
结果
print(df_want)
point closest_point distance
0 0 2 1.093309
1 1 3 0.932863
2 2 1 0.972370
3 3 1 0.932863