生成纬度和经度位置之间距离矩阵的最快方法是什么？答案

【问题标题】：What is the fastest way to generate a matrix for distances between location with lat and lon?生成纬度和经度位置之间距离矩阵的最快方法是什么？
【发布时间】：2022-01-18 01:34:49
【问题描述】：

感谢您阅读本文。目前我在很多地方都有很多经纬度，我需要为 10 公里内的位置创建一个距离矩阵。（可以将矩阵填充为 0 距离远超过 10 公里的位置）。

数据如下：

place_coordinates=[[lat1, lon1],[lat2,lat2],...]

在这种情况下，我使用下面的代码来计算它，但是它需要很长时间。

place_correlation = pd.DataFrame(
   squareform(pdist(place_coordinates, metric=haversine)),
   index=place_coordinates,
   columns=place_coordinates
)

使用squareform时，如果在10km以外，不知道怎么不保存也不计算。

最快的方法是什么？

提前谢谢你！

【问题讨论】：

有很多方法可以计算纬度、经度坐标之间的距离。您想使用什么指标和技术？
检查我的答案，如果您需要更好的答案，请提供最小的工作示例并说明您的要求（例如输入数组有多长？您需要使用半正弦度量还是有其他适合您的情况的可能性？）。

标签： python pandas dataframe scipy pdist

【解决方案1】：

首先，距离计算需要使用haversine度量吗？您使用哪种实现方式？如果你会使用例如euclidean metric 你的计算会更快，但我想你有充分的理由选择这个指标。

在这种情况下，使用haversine 的更优化实现可能会更好（但我不知道您使用哪种实现）。检查例如this SO question.

我猜你正在使用来自scipy.spatial.distance 的pdist 和squareform。当您查看后面的实现 (here) 时，您会发现它们正在使用 for 循环。在这种情况下，您可以使用一些矢量化实现（例如上面链接问题中的this one）。

import numpy as np
import itertools
from scipy.spatial.distance import pdist, squareform
from haversine import haversine  # pip install haversine

# original approach
place_coordinates = [(x, y) for x in range(10) for y in range(10)]
d = pdist(place_coordinates, metric=haversine)

# approach using combinations
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
d2 = [haversine(x, y) for (x, y) in place_coordinates_comb]

# just ensure that using combinations give you the same results as using pdist
np.testing.assert_array_equal(d, d2)

# vectorized version (taken from the link above)
# 1) create combination (note that haversine implementation from the link above takes (lon1, lat1, lon2, lat2) as arguments, that's why we do flatten
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
place_coordinates_comb_flatten = [(*x, *y) for (x, y) in place_coordinates_comb]
# 2) use format required by this impl
lon1, lat1, lon2, lat2 = np.array(place_coordinates_comb_flatten).T
# 3) vectorized comp
d_vect = haversine_np(lon1, lat1, lon2, lat2)

# it slightly differs from the original haversine package, but it's ok imo and vectorized implementation can be ofc improve to return exactly the same results
np.testing.assert_array_equal(d, d_vect)

当您比较时间时（绝对数字会因使用的机器而异）：

%timeit pdist(place_coordinates, metric=haversine)
# 15.7 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit haversine_np(lon1, lat1, lon2, lat2)
# 241 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这是相当多的（约快 60 倍）。当您的数组非常长时（您使用了多少个坐标？），这对您有很大帮助。

最后，您可以使用您的代码组合它：

place_correlation = pd.DataFrame(squareform(d_vect), index=place_coordinates, columns=place_coordinates)

额外的改进可能是使用另一个指标（例如，euclidean 会更快）来快速判断哪些距离在 10 公里之外，然后计算 haversine 其余的距离。

【讨论】：

非常感谢！即使在我需要以公里计算的情况下，如果它快 60 倍，我也必须将欧几里得转换为公里。非常感谢。
原始实现和向量化实现的区别，如果我不够清楚，请见谅。但即使在原始实现（pdist 带有 for 循环）中尝试将方法切换为“欧几里得”时，它也更快（它给了我大约 150 倍的速度）。我需要更多的测量值，但切换到欧几里得肯定会对你有所帮助。
150 倍的差异对我来说是一个巨大的差异。真的帮了大忙！非常感谢！