【问题标题】:Speed up getting distance between two lat and lon加快获得两个纬度和经度之间的距离
【发布时间】:2019-12-02 08:25:45
【问题描述】:

我有两个包含 Lat 和 Lon 的 DataFrame。我想从另一个 DataFrame 中找到从一对 (Lat, Lon)ALL (Lat, Lon) 的距离并获得最小值。我正在使用的包geopy。代码如下:

from geopy import distance
import numpy as np

distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
    target = (row1["LAT"], row1["LON"])
    count = count + 1
    print(count)
    for id2, row2 in df2.iterrows():
        point = (row2["LAT"], row2["LON"])
        distanceMiles.append(distance.distance(target, point).miles)

    closestPoint = np.argmin(distanceMiles)
    distanceMiles = []

问题是df1168K 行,df21200 行。如何让它更快?

【问题讨论】:

    标签: python-3.x pandas gis geopy


    【解决方案1】:

    如果您使用 itertools 而不是显式的 for 循环,这应该会运行得更快。内联 cmets 应该可以帮助您了解每一步发生的情况。

    import numpy as np
    import itertools
    from geopy import distance
    
    
    #Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
    df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
    df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})
    
    
    #Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
    target = list(zip(df1['LAT'], df1['LON']))
    point = list(zip(df2['LAT'], df2['LON']))
    
    
    #Product function in itertools does a cross product between the 2 iteratables
    #You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
    product = list(itertools.product(target, point)])
    
    #starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
    geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
    len(geo_dist)
    
    50
    
    geo_dist = [42.430772028845716,
     44.29982320107605,
     25.88823239877388,
     23.877570442142783,
     29.9351451072828,
     ...]
    

    最后, 如果您正在使用大量数据集,那么我建议使用多处理库将 itertools.starmap 映射到不同的核心并异步计算距离值。 Python 多处理库现在支持星图。

    【讨论】:

      【解决方案2】:

      如果您需要通过蛮力检查所有对,我认为以下方法是您能做的最好的。
      直接在列上循环通常比iterrows 稍快一些,并且替换内循环的矢量化方法也可以节省时间。

      for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
          target = (lat1, lon1)
          count = count + 1
          #    print(count) #printing is also time expensive
          df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
          closestpoint = df2['dist'].min() #if you want the minimum distance
          closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.
      

      【讨论】:

        【解决方案3】:

        geopy.distance.distanceuses geodesic algorithm by default,这比较慢但更准确。如果您可以用准确性换取速度,您可以使用great_circle,它的速度要快约 20 倍:

        In [4]: %%timeit
           ...: distance.distance(newport_ri, cleveland_oh).miles
           ...:
        236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
        
        In [5]: %%timeit
           ...: distance.great_circle(newport_ri, cleveland_oh).miles
           ...:
        13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
        

        您也可以使用多处理来并行化计算:

        from multiprocessing import Pool
        from geopy import distance
        import numpy as np
        
        
        def compute(points):
            target, point = points
            return distance.great_circle(target, point).miles
        
        
        with Pool() as pool:
            for id1, row1 in df1.iterrows():
                target = (row1["LAT"], row1["LON"])
                distanceMiles = pool.map(
                    compute,
                    (
                        (target, (row2["LAT"], row2["LON"]))
                        for id2, row2 in df2.iterrows()
                    )
                )
                closestPoint = np.argmin(distanceMiles)
        

        【讨论】:

          【解决方案4】:

          把这个留在这里以防将来有人需要它:

          如果您只需要最小距离,那么您不必强制所有对。有一些数据结构可以帮助您以 O(n*log(n)) 的时间复杂度解决这个问题,这比蛮力方法要快得多。

          例如,您可以使用广义的 KNearestNeighbors(k=1)算法来做到这一点,前提是您要注意您的点在球体上,而不是在平面上。见this SO answer for an example implementation using sklearn

          似乎也有一些库可以解决这个问题,例如 sknniGriSPy

          Here 也是另一个问题,稍微谈了点理论。

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 2018-10-11
            • 1970-01-01
            • 2018-06-11
            • 1970-01-01
            • 2010-11-03
            相关资源
            最近更新 更多