加快获得两个纬度和经度之间的距离答案

【问题标题】：Speed up getting distance between two lat and lon加快获得两个纬度和经度之间的距离
【发布时间】：2019-12-02 08:25:45
【问题描述】：

我有两个包含 Lat 和 Lon 的 DataFrame。我想从另一个 DataFrame 中找到从一对 (Lat, Lon) 到 ALL (Lat, Lon) 的距离并获得最小值。我正在使用的包geopy。代码如下：

from geopy import distance
import numpy as np

distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
    target = (row1["LAT"], row1["LON"])
    count = count + 1
    print(count)
    for id2, row2 in df2.iterrows():
        point = (row2["LAT"], row2["LON"])
        distanceMiles.append(distance.distance(target, point).miles)

    closestPoint = np.argmin(distanceMiles)
    distanceMiles = []

问题是df1 有168K 行，df2 有1200 行。如何让它更快？

【问题讨论】：

标签： python-3.x pandas gis geopy

【解决方案1】：

如果您使用 itertools 而不是显式的 for 循环，这应该会运行得更快。内联 cmets 应该可以帮助您了解每一步发生的情况。

import numpy as np
import itertools
from geopy import distance


#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})


#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))


#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])

#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)

geo_dist = [42.430772028845716,
 44.29982320107605,
 25.88823239877388,
 23.877570442142783,
 29.9351451072828,
 ...]

最后，如果您正在使用大量数据集，那么我建议使用多处理库将 itertools.starmap 映射到不同的核心并异步计算距离值。 Python 多处理库现在支持星图。

【讨论】：

【解决方案2】：

如果您需要通过蛮力检查所有对，我认为以下方法是您能做的最好的。
直接在列上循环通常比iterrows 稍快一些，并且替换内循环的矢量化方法也可以节省时间。

for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
    target = (lat1, lon1)
    count = count + 1
    #    print(count) #printing is also time expensive
    df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
    closestpoint = df2['dist'].min() #if you want the minimum distance
    closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.

【讨论】：

【解决方案3】：

geopy.distance.distanceuses geodesic algorithm by default，这比较慢但更准确。如果您可以用准确性换取速度，您可以使用great_circle，它的速度要快约 20 倍：

In [4]: %%timeit
   ...: distance.distance(newport_ri, cleveland_oh).miles
   ...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit
   ...: distance.great_circle(newport_ri, cleveland_oh).miles
   ...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

您也可以使用多处理来并行化计算：

from multiprocessing import Pool
from geopy import distance
import numpy as np


def compute(points):
    target, point = points
    return distance.great_circle(target, point).miles


with Pool() as pool:
    for id1, row1 in df1.iterrows():
        target = (row1["LAT"], row1["LON"])
        distanceMiles = pool.map(
            compute,
            (
                (target, (row2["LAT"], row2["LON"]))
                for id2, row2 in df2.iterrows()
            )
        )
        closestPoint = np.argmin(distanceMiles)

【讨论】：

【解决方案4】：

把这个留在这里以防将来有人需要它：

如果您只需要最小距离，那么您不必强制所有对。有一些数据结构可以帮助您以 O(n*log(n)) 的时间复杂度解决这个问题，这比蛮力方法要快得多。

例如，您可以使用广义的 KNearestNeighbors（k=1）算法来做到这一点，前提是您要注意您的点在球体上，而不是在平面上。见this SO answer for an example implementation using sklearn。

似乎也有一些库可以解决这个问题，例如 sknni 和 GriSPy。

Here 也是另一个问题，稍微谈了点理论。

【讨论】：