【问题标题】:Find the nearest point in distance for all the points in the dataset - Python查找数据集中所有点的距离最近的点 - Python
【发布时间】:2017-07-16 09:42:01
【问题描述】:

我有一个数据集如下,

Id     Latitude      longitude
1      25.42         55.47
2      25.39         55.47
3      24.48         54.38
4      24.51         54.54

我想为数据集的每个点找到最近的距离。我在网上找到了以下距离函数,

from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

我正在使用以下功能,

shortest_distance = []
for i in range(1,len(data)):
    distance1 = []
    for j in range(1,len(data)):
        distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
    shortest_distance.append(min(distance1))

但是这段代码对每个条目循环两次并返回 n^2 次迭代,这反过来又很慢。我的数据集包含近 100 万条记录,每次循环遍历所有元素两次变得非常昂贵。

我想找到更好的方法来找出每一行的最近点。任何人都可以帮助我找到在 python 中解决这个问题的方法吗?

谢谢

【问题讨论】:

    标签: python python-2.7 python-3.x


    【解决方案1】:

    找到最接近给定点的N 点的蛮力方法是O(N)——您必须检查每个点。 相反,如果N 点存储在KD-tree 中,那么找到最近的点平均为O(log(N))。 构建 KD-tree 还需要额外的一次性成本,这需要O(N) 时间。

    如果你需要重复这个过程N次,那么蛮力方法是O(N**2),kd-tree方法是O(N*log(N))。 因此,对于足够大的N,KD-tree 将击败蛮力方法。

    有关最近邻算法(包括 KD-tree)的更多信息,请参阅 here


    下面(在函数using_kdtree 中)是一种使用scipy.spatial.kdtree 计算最近邻居的大圆弧长的方法。

    scipy.spatial.kdtree 使用点之间的欧几里德距离,但有一个 formula 用于将球体上的点之间的欧几里德弦距离转换为大圆弧长(给定球体的半径)。 所以思路是将纬度/经度数据转换成笛卡尔坐标,使用KDTree找到最近的邻居,然后应用great circle distance formula得到想要的结果。


    这里有一些基准。使用N = 100using_kdtreeorig(蛮力)方法快 39 倍。

    In [180]: %timeit using_kdtree(data)
    100 loops, best of 3: 18.6 ms per loop
    
    In [181]: %timeit using_sklearn(data)
    1 loop, best of 3: 214 ms per loop
    
    In [179]: %timeit orig(data)
    1 loop, best of 3: 728 ms per loop
    

    对于N = 10000

    In [5]: %timeit using_kdtree(data)
    1 loop, best of 3: 2.78 s per loop
    
    In [6]: %timeit using_sklearn(data)
    1 loop, best of 3: 1min 15s per loop
    
    In [7]: %timeit orig(data)
    # untested; too slow
    

    由于 using_kdtreeO(N log(N)) 并且 origO(N**2),因此因子 其中using_kdtreeorig 快将增长为N,长度为 data,成长。


    import numpy as np
    import scipy.spatial as spatial
    import pandas as pd
    import sklearn.neighbors as neighbors
    from math import radians, cos, sin, asin, sqrt
    
    R = 6367
    
    def using_kdtree(data):
        "Based on https://*.com/q/43020919/190597"
        def dist_to_arclength(chord_length):
            """
            https://en.wikipedia.org/wiki/Great-circle_distance
            Convert Euclidean chord length to great circle arc length
            """
            central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
            arclength = R*central_angle
            return arclength
    
        phi = np.deg2rad(data['Latitude'])
        theta = np.deg2rad(data['Longitude'])
        data['x'] = R * np.cos(phi) * np.cos(theta)
        data['y'] = R * np.cos(phi) * np.sin(theta)
        data['z'] = R * np.sin(phi)
        tree = spatial.KDTree(data[['x', 'y','z']])
        distance, index = tree.query(data[['x', 'y','z']], k=2)
        return dist_to_arclength(distance[:, 1])
    
    def orig(data):
        def distance(lon1, lat1, lon2, lat2):
            """
            Calculate the great circle distance between two points 
            on the earth (specified in decimal degrees)
            """
            # convert decimal degrees to radians 
            lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
            # haversine formula 
            dlon = lon2 - lon1 
            dlat = lat2 - lat1 
            a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
            c = 2 * asin(sqrt(a)) 
            km = R * c
            return km
    
        shortest_distance = []
        for i in range(len(data)):
            distance1 = []
            for j in range(len(data)):
                if i == j: continue
                distance1.append(distance(data['Longitude'][i], data['Latitude'][i], 
                                          data['Longitude'][j], data['Latitude'][j]))
            shortest_distance.append(min(distance1))
        return shortest_distance
    
    
    def using_sklearn(data):
        """
        Based on https://*.com/a/45127250/190597 (Jonas Adler)
        """
        def distance(p1, p2):
            """
            Calculate the great circle distance between two points
            on the earth (specified in decimal degrees)
            """
            lon1, lat1 = p1
            lon2, lat2 = p2
            # convert decimal degrees to radians
            lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
            # haversine formula
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
            c = 2 * np.arcsin(np.sqrt(a))
            km = R * c
            return km
        points = data[['Longitude', 'Latitude']]
        nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
        distances, indices = nbrs.kneighbors(points)
        result = distances[:, 1]
        return result
    
    np.random.seed(2017)
    N = 1000
    data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 
                         'Longitude':np.random.uniform(0,360,size=N)})
    
    expected = orig(data)
    for func in [using_kdtree, using_sklearn]:
        result = func(data)
        assert np.allclose(expected, result)
    

    【讨论】:

      【解决方案2】:

      你可以通过调用一个为此实现智能算法的库来做到这一点very efficiently,一个例子是 sklearn,它有一个 NearestNeighbors 方法可以做到这一点。

      为此修改的代码示例:

      from sklearn.neighbors import NearestNeighbors
      import numpy as np
      
      def distance(p1, p2):
          """
          Calculate the great circle distance between two points
          on the earth (specified in decimal degrees)
          """
          lon1, lat1 = p1
          lon2, lat2 = p2
          # convert decimal degrees to radians
          lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
          # haversine formula
          dlon = lon2 - lon1
          dlat = lat2 - lat1
          a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
          c = 2 * np.arcsin(np.sqrt(a))
          km = 6367 * c
          return km
      
      points = [[25.42, 55.47],
                [25.39, 55.47],
                [24.48, 54.38],
                [24.51, 54.54]]
      
      nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
      
      distances, indices = nbrs.kneighbors(points)
      
      result = distances[:, 1]
      

      给了

      >>> result
      array([  1.889697  ,   1.889697  ,  17.88530556,  17.88530556])
      

      【讨论】:

        【解决方案3】:

        您可以使用字典来散列一些计算。您的代码多次计算 A 到 B 的距离(A 和 B 是数据集中的 2 个任意点)。

        要么实现你自己的缓存:

        from math import radians, cos, sin, asin, sqrt
        
        dist_cache = {}
        def distance(lon1, lat1, lon2, lat2):
            """
            Calculate the great circle distance between two points 
            on the earth (specified in decimal degrees)
            """
        
            try:
                return dist_cache[(lon1, lat1, lon2, lat2)]
            except KeyError:
                # convert decimal degrees to radians 
                lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
                # haversine formula 
                dlon = lon2 - lon1 
                dlat = lat2 - lat1 
                a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
                c = 2 * asin(sqrt(a)) 
                km = 6367 * c
                dist_cache[(lon1, lat1, lon2, lat2)] = km
                return km
        

        或者使用lru_cache:

        from math import radians, cos, sin, asin, sqrt
        from functools import lru_cache
        
        @lru_cache
        def distance(lon1, lat1, lon2, lat2):
            """
            Calculate the great circle distance between two points 
            on the earth (specified in decimal degrees)
            """
            # convert decimal degrees to radians 
            lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
            # haversine formula 
            dlon = lon2 - lon1 
            dlat = lat2 - lat1 
            a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
            c = 2 * asin(sqrt(a)) 
            km = 6367 * c
            return km
        

        【讨论】:

        • 这是我第一次听到这样的声音。如何将它与我的数据框一起使用?
        • @haimen 你什么意思?我已经向您展示了如何修改 distance 函数以使用缓存。
        • 对不起.. 我的错。让我试试这个,看看它是如何工作的。
        最近更新 更多