【问题标题】:Pairwise haversine distance calculation成对haversine距离计算
【发布时间】:2016-04-06 02:28:08
【问题描述】:

我有两个带 lat 和 long 的数组。我想计算数组中每对纬度和经度与其他每对经度和经度之间的距离。 这是我的两个数组。

lat_array

array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132,
        0.33370132,  0.33370132,  0.33371075,  0.33371075,  0.33370132,
        0.33370132,  0.33370132,  0.33356488,  0.33356488,  0.33370132,
        0.33370132,  0.33370132,  0.33401788,  0.33362632,  0.33362632,
        0.33364007,  0.33370132,  0.33401788,  0.33401788,  0.33358399,
        0.33358399,  0.33358399,  0.33370132,  0.33370132,  0.33362632,
        0.33370132,  0.33370132,  0.33370132,  0.33370132,  0.33370132,
        0.33356488,  0.33356456,  0.33391071,  0.33370132,  0.33356488,
        0.33356488,  0.33356456,  0.33356456,  0.33356456,  0.33362632,
        0.33364804,  0.3336314 ,  0.33370132,  0.33370132,  0.33370132,
        0.33364034,  0.33359921,  0.33370132,  0.33360397,  0.33348863,
        0.33370132])
long_array

array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27246931,  1.27246931,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27254305,  1.27254305,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27259085,  1.27250461,  1.27250461,
        1.27251211,  1.2724337 ,  1.27259085,  1.27259085,  1.27252134,
        1.27252134,  1.27252134,  1.2724337 ,  1.2724337 ,  1.27250461,
        1.2724337 ,  1.2724337 ,  1.2724337 ,  1.2724337 ,  1.2724337 ,
        1.27254305,  1.27253229,  1.27266808,  1.2724337 ,  1.27254305,
        1.27254305,  1.27253229,  1.27253229,  1.27253229,  1.27250461,
        1.27250534,  1.27250184,  1.2724337 ,  1.2724337 ,  1.2724337 ,
        1.27251339,  1.27223739,  1.2724337 ,  1.2722575 ,  1.27237575,
        1.2724337 ])

转换为弧度后。现在我想要第一对lat和long之间的距离,剩下的lat和long对等等。并想打印对和对应的距离。

这就是我在 python 中所做的。

distance = []
R = 6371.0

for i in range(len(lat_array)):
   for j in (i+1,len(lat_array)):
      dlon = long_array[j]-long_array[i]
      dlat = lat_array[j]-lat_array[i]
      a = sin(dlat / 2)**2 + cos(lat_array[i]) * cos(lat_array[j]) *     
          sin(dlon / 2)**2
      c = 2 * atan2(sqrt(a), sqrt(1 - a))

      distance.append(R * c)

它给了我一个错误IndexError: index 56 is out of bounds for axis 0 with size 56 我在哪里做错了?如果数组很大,如何使计算更快?请帮忙。

【问题讨论】:

    标签: python arrays performance numpy haversine


    【解决方案1】:

    由于这是目前 Google 的“成对半正弦距离”的最高结果,我将加两分钱:如果您可以访问 scikit-learn,这个问题可以很快得到解决。查看sklearn.metrics.pairwise_distances 时,您会注意到不支持“haversine”指标,但它在sklearn.neighbors.DistanceMetric 中实现。

    这意味着您可以执行以下操作:

    from sklearn.neighbors import DistanceMetric
    
    def sklearn_haversine(lat, lon):
        haversine = DistanceMetric.get_metric('haversine')
        latlon = np.hstack((lat[:, np.newaxis], lon[:, np.newaxis]))
        dists = haversine.pairwise(latlon)
        return 6371 * dists
    

    请注意,latlon 的串联只是必要的,因为它们是单独的数组。如果您将它们作为形状(n_samples, 2) 的组合数组传递,您可以直接在它们上调用haversine.pairwise。此外,仅当您需要以公里为单位的距离时,才需要乘以 6371。例如。如果您只想找到最近的一对点,则无需执行此步骤。

    验证:

    In [87]: lat = np.array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])
    
    In [88]: lng = np.array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])
    
    In [89]: sklearn_haversine(lat, lng)
    Out[89]:
    array([[ 0.        ,  0.25227021,  0.25227021,  2.90953323,  1.05422047],
           [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
           [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
           [ 2.90953323,  3.00383463,  3.00383463,  0.        ,  2.2276139 ],
           [ 1.05422047,  0.98975923,  0.98975923,  2.2276139 ,  0.        ]])
    

    性能:

    In [91]: lat = np.random.randn(1000)
    
    In [92]: lng = np.random.randn(1000)
    
    In [93]: %timeit original_app(lat,lng)
    1 loops, best of 3: 1.46 s per loop
    
    In [94]: %timeit vectorized_app1(lat,lng)
    10 loops, best of 3: 86.7 ms per loop
    
    In [95]: %timeit vectorized_app2(lat,lng)
    10 loops, best of 3: 75.7 ms per loop
    
    In [96]: %timeit sklearn_haversine(lat,lng)
    10 loops, best of 3: 76 ms per loop
    

    总之,您可以用更短更简单的代码以vectorized_app2 的速度获得 Divakar 的vectorized_app1 的输出。

    【讨论】:

      【解决方案2】:

      假设latlng 是纬度和经度数组,并且它们的数据以弧度为单位,这是一个基于this other solution 的矢量化解决方案-

      # Elementwise differentiations for lattitudes & longitudes
      dflat = lat[:,None] - lat
      dflng = lng[:,None] - lng
      
      # Finally Calculate haversine using its distance formula
      d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
      hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))
      

      现在,上述方法将为我们提供所有对的输出,而不管它们的顺序如何。因此,即使距离相同,我们也会为两对 :(point1,point2)(point2,point1) 提供两个距离输出。因此,为了节省内存并希望获得更好的性能,您可以使用 np.triu_indices 创建唯一的配对 ID,并修改前面列出的方法,如下所示 -

      # Elementwise differentiations for lattitudes & longitudes, 
      # but not repeat for the same paired elements
      N = lat.size
      idx1,idx2 = np.triu_indices(N,1)
      dflat = lat[idx2] - lat[idx1]
      dflng = lng[idx2] - lng[idx1]
      
      # Finally Calculate haversine using its distance formula
      d = np.sin(dflat/2)**2 + np.cos(lat[idx2])*np.cos(lat[idx1]) * np.sin(dflng/2)**2
      hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))
      

      函数定义-

      def original_app(lat,lng):
          distance = []
          R = 6371.0
          for i in range(len(lat)):
             for j in range(i+1,len(lat)):
                dlon = lng[j]-lng[i]
                dlat = lat[j]-lat[i]
                a = sin(dlat / 2)**2 + cos(lat[i]) * cos(lat[j]) * sin(dlon / 2)**2
                c = 2 * atan2(sqrt(a), sqrt(1 - a))
                distance.append(R * c)
          return distance
      
      def vectorized_app1(lat,lng):                               
          dflat = lat[:,None] - lat
          dflng = lng[:,None] - lng
          d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
          return 2 * 6371 * np.arcsin(np.sqrt(d))
      
      def vectorized_app2(lat,lng):                               
          N = lat.size
          idx1,idx2 = np.triu_indices(N,1)
          dflat = lat[idx2] - lat[idx1]
          dflng = lng[idx2] - lng[idx1]
          d =np.sin(dflat/2)**2+np.cos(lat[idx2])*np.cos(lat[idx1])*np.sin(dflng/2)**2
          return  2 * 6371 * np.arcsin(np.sqrt(d))
      

      验证输出 -

      In [78]: lat
      Out[78]: array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])
      
      In [79]: lng
      Out[79]: array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])
      
      In [80]: original_app(lat,lng)
      Out[80]: 
      [0.2522702110418014,
       0.2522702110418014,
       2.909533226553249,
       1.0542204712876762,
       0.0,
       3.003834632906676,
       0.9897592295963831,
       3.003834632906676,
       0.9897592295963831,
       2.2276138997714474]
      
      In [81]: vectorized_app1(lat,lng)
      Out[81]: 
      array([[ 0.        ,  0.25227021,  0.25227021,  2.90953323,  1.05422047],
             [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
             [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
             [ 2.90953323,  3.00383463,  3.00383463,  0.        ,  2.2276139 ],
             [ 1.05422047,  0.98975923,  0.98975923,  2.2276139 ,  0.        ]])
      
      In [82]: vectorized_app2(lat,lng)
      Out[82]: 
      array([ 0.25227021,  0.25227021,  2.90953323,  1.05422047,  0.        ,
              3.00383463,  0.98975923,  3.00383463,  0.98975923,  2.2276139 ])
      

      运行时测试-

      In [83]: lat = np.random.randn(1000)
      
      In [84]: lng = np.random.randn(1000)
      
      In [85]: %timeit original_app(lat,lng)
      1 loops, best of 3: 2.11 s per loop
      
      In [86]: %timeit vectorized_app1(lat,lng)
      1 loops, best of 3: 263 ms per loop
      
      In [87]: %timeit vectorized_app2(lat,lng)
      1 loops, best of 3: 224 ms per loop
      

      因此,就性能而言,vectorized_app2 可能是要走的路!

      【讨论】:

      • lat[:,None] 是什么意思?
      • @kfx 这意味着我们通过添加一个单一维度(沿该维度的元素数为 1 的维度)将 lat 从一维数组扩展到二维数组,因此当减去 lat ,广播将发生,我们将对lat 中的每个元素与其中的所有其他元素相减,作为一个二维数组。有关更多信息,请参阅广播文档 - docs.scipy.org/doc/numpy/user/basics.broadcasting.html
      【解决方案3】:

      您的代码中有错字。改变

      for j in (i+1,len(lat_array)):
      

      for j in range(i+1,len(lat_array)):
      

      否则,您将迭代一个由两个元素 i+1len(lat_array) 组成的元组。第二个导致错误。

      【讨论】:

      • 哦,我明白了。那真是个愚蠢的错误。但是我如何打印有距离的对呢?
      【解决方案4】:

      scikit-learn 0.21.0(2019-05 发布)中引入的 hasrsine_distances 函数可用于此目的。示例命令:

       % ipython     
      Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
      Type 'copyright', 'credits' or 'license' for more information
      IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
      
      In [1]: 
      import numpy as np
      lat = np.array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])
      lon = np.array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])
      position = np.column_stack((lat, lon))
      position
      Out[1]: 
      array([[0.33356456, 1.27253229],
             [0.33355585, 1.27249141],
             [0.33355585, 1.27249141],
             [0.33401788, 1.27259085],
             [0.33370132, 1.2724337 ]])
      
      In [2]: 
      from sklearn.metrics.pairwise import haversine_distances
      R = 6371.0
      D1 = R * haversine_distances(position)
      D1
      Out[2]: 
      array([[0.        , 0.25227021, 0.25227021, 2.90953323, 1.05422047],
             [0.25227021, 0.        , 0.        , 3.00383463, 0.98975923],
             [0.25227021, 0.        , 0.        , 3.00383463, 0.98975923],
             [2.90953323, 3.00383463, 3.00383463, 0.        , 2.2276139 ],
             [1.05422047, 0.98975923, 0.98975923, 2.2276139 , 0.        ]])
      

      参考:-

      【讨论】:

        猜你喜欢
        • 2017-11-24
        • 1970-01-01
        • 2016-04-02
        • 2016-10-18
        • 1970-01-01
        • 2020-01-01
        • 2013-09-02
        • 1970-01-01
        • 2014-04-29
        相关资源
        最近更新 更多