Python最近邻 - 坐标答案

【问题标题】：Python nearest neighbour - coordinatesPython最近邻 - 坐标
【发布时间】：2016-02-05 15:20:19
【问题描述】：

我想检查我是否正确使用了 scipy 的 KD 树，因为它看起来比简单的暴力破解要慢。

我对此有三个问题：

第一季度。

如果我创建以下测试数据：

nplen = 1000000
# WGS84 lat/long
point = [51.349,-0.19]
# This contains WGS84 lat/long
points = np.ndarray.tolist(np.column_stack(
        [np.round(np.random.randn(nplen)+51,5),
         np.round(np.random.randn(nplen),5)]))

并创建三个函数：

def kd_test(points,point):
    """ KD Tree"""
    return points[spatial.KDTree(points).query(point)[1]]

def ckd_test(points,point):
    """ C implementation of KD Tree"""
    return points[spatial.cKDTree(points).query(point)[1]]

def closest_math(points,point):
    """ Simple angle"""
    return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))[1:3]

我希望 cKD 树是最快的，但是 - 运行这个：

print("Co-ordinate: ", f(points,point))
print("Index: ", points.index(list(f(points,point))))
%timeit f(points,point)

结果时间 - 简单的暴力破解方法更快：

closest_math: 1 loops, best of 3: 3.59 s per loop
ckd_test: 1 loops, best of 3: 13.5 s per loop
kd_test: 1 loops, best of 3: 30.9 s per loop

这是因为我用错了 - 不知何故？

第二季度。

我假设即使要获得最近点的排名（而不是距离），仍然需要投影数据。但是，投影点和未投影点似乎给了我相同的最近邻居：

def proj_list(points,
              inproj = Proj(init='epsg:4326'),
              outproj = Proj(init='epsg:27700')):
    """ Projected geo coordinates"""
    return [list(transform(inproj,outproj,x,y)) for y,x in points]
proj_points = proj_list(points)
proj_point = proj_list([point])[0]

这仅仅是因为我的点分布不够大而不会引入失真吗？我重新运行了几次，仍然从返回的投影和未投影列表中得到相同的索引。

第三季度。

与在（未投影的）纬度/经度上计算半正弦或文森特距离相比，投影点（如上）并计算斜边距离通常更快吗？还有哪个选项更准确？我做了一个小测试：

from math import *
def haversine(origin,
              destination):
    """
    Find distance between a pair of lat/lng coordinates
    """
    lat1, lon1, lat2, lon2 = map(radians, [origin[0],origin[1],destination[0],destination[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    r = 6371000  # Metres
    return (c * r)

def closest_math_unproj(points,point):
    """ Haversine on unprojected """
    return (min((haversine(point,pt),pt[0],pt[1]) for pt in points))

def closest_math_proj(points,point):
    """ Simple angle since projected"""
    return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))

结果：

所以这似乎是说先投影然后做距离比不做更快 - 但是，我不确定哪种方法会带来更准确的结果。

在online vincenty calculation 上进行测试似乎是预测坐标是可行的方法：

【问题讨论】：

一个几乎不相关的建议：使用%timeit -n 10 f(points,point) 可能比使用%timeit for x in range(10): f(points,point) 更方便。
顺便说一下，github.com/storpipfugl/pykdtree 可能值得一看。与蛮力方法相比，这可能无法解决效率问题，但可能会比 scipy 的默认实现快一点。

标签： python scipy spatial kdtree map-projections

【解决方案1】：

第一季度。

k-d 树明显低效的原因很简单：您同时测量 k-d 树的构造和查询。这不是您将或应该使用 k-d 树的方式：您应该只构建一次。如果您只测量查询，所花费的时间将减少到仅几十毫秒（与使用蛮力方法的秒数相比）。

第二季度。

这将取决于所使用的实际数据的空间分布和所使用的投影。根据 k-d 树的实现在平衡构建树方面的效率，可能存在细微差别。如果您只查询一个点，那么结果将是确定性的，并且不受点分布的影响。

对于您使用的样本数据，它具有很强的中心对称性，并且对于您的地图投影（横向墨卡托），差异应该可以忽略不计。

第三季度。

从技术上讲，您的问题的答案很简单：使用 Haversine 公式进行地理距离测量既更准确，也更慢。准确性和速度之间的权衡是否合理在很大程度上取决于您的用例和数据的空间分布（显然主要取决于空间范围）。

如果您的点的空间范围较小，区域性较小，那么使用合适的投影和简单的欧几里得距离测量对于您的用例可能足够准确，并且比使用 Haversine 公式更快。

【讨论】：

谢谢马丁 - 这回答了一切。我只是想检查一下您是否说 Haversine 公式会更准确（因此通过扩展 vincenty 公式）。这意味着如果准确性非常重要，那么向量化的 numpy vincenty 公式就是要走的路？
对不起，我的意思是 - 如果我在英国有（例如）1000 万个坐标，我的主要目标是尽量减少距离误差（+- 1 米很好）那么我应该使用带有矢量化 vincenty 公式的 scipy.pdist，而不是投影坐标然后运行矢量化欧几里得距离？
啊，对不起。我误读了最后一个问题，错过了你问的是haversine或vincenty公式。你可以无视我最后的回答。最后一个问题可能比 SO 更适合 gis.stackexchange.com。