在大型 GeoJSON 对象中查找多边形坐标交点的最有效方法答案

【问题标题】：Most efficient way to find polygon-coordinate intersections in large GeoJSON object在大型 GeoJSON 对象中查找多边形坐标交点的最有效方法
【发布时间】：2021-12-14 21:50:16
【问题描述】：

我正在处理一个需要坐标映射的项目 - 确定坐标点是否存在于一系列多边形中。映射的数量非常大 - 大约 1000 万个坐标跨越 100+ 百万个多边形。

在继续之前，我已经查看了问题here 和here。这个问题并不多余，因为它涉及动态点和静态多边形。

我通过在 200 万个多边形的子集上映射单个坐标来缩小项目的范围。这是我使用的代码：

from shapely.geometry import shape, Point

f = open('path/to/file.geojson', 'r')
data = json.loads(f.read())

point = Point(42.3847, -71.127411)
for feature in data['features']:
    polygon = shape(feature['geometry'])
    if polygon.contains(point):
        print(polygon)

迭代 200 万个多边形，在本例中是建筑足迹，大约需要 30 秒（太多时间）。

我也尝试过使用mplPath，如下所示：

import matplotlib.path as mplPath

building_arrays = [np.array(data['features'][i]['geometry']['coordinates'][0])
                   for i, v in enumerate(tqdm(data['features']))]
bbPath_list = [mplPath.Path(building)
               for building in tqdm(building_arrays)]

for b in tqdm(bbPath_list):
    if b.contains_point((-71.1273842, 42.3847423)):
        print(b)

这需要大约 6 秒。一个改进，但考虑到我需要的映射量仍然有点慢。

有没有更快的方法来实现这样的映射？我不喜欢使用 PySpark 和分布式计算，因为我认为这是一个核选项，但如果需要，我愿意使用它。是否可以对计算进行矢量化而不是遍历多边形？我将生成一个更新，显示使用 numba 是否有任何改进。

【问题讨论】：

一定要试试Paul's answer。 Numba 只有在您可以使用 nopython 模式时才会有所帮助，这意味着没有 shapely 或 matplotlib，所以......我不建议这样做。如果 geopandas.sjoin 不够快，我建议使用 Google BigQuery 或其他支持空间的列式查询框架——它们对于这种类型的东西来说非常快而且便宜，而且我从来没有能够使用对于这样的空间查询，像 dask 这样的分布式引擎正在接近您在那里看到的速度。

标签： python python-3.x computational-geometry geopandas shapely

【解决方案1】：

我会使用空间连接。

鉴于这些虚假数据：

我会用“内部”谓词加入它：

from shapely.geometry import Point, Polygon
import geopandas

polys = geopandas.GeoDataFrame({
    "name": ["foo", "bar"],
    "geometry": [
        Polygon([(5, 5), (5, 13), (13, 13), (13, 5)]),
        Polygon([(10, 10), (10, 15), (15, 15), (15, 10)]),
    ]
})

pnts = geopandas.GeoDataFrame({
    "pnt": ["A", "B", "C"],
    "geometry": [
        Point(3, 3), Point(8, 8), Point(11, 11)
    ]
})

result = geopandas.sjoin(pnts, polys, how='left', op='within')

我明白了：

pnt                  geometry  index_right name
  A   POINT (3.00000 3.00000)          NaN  NaN
  B   POINT (8.00000 8.00000)          0.0  foo
  C POINT (11.00000 11.00000)          0.0  foo
  C POINT (11.00000 11.00000)          1.0  bar

【讨论】：

像魅力一样工作，谢谢！在 2 秒内将 100 万个点映射到 200 万个多边形
@mmz 这是一个非常好的基准。感谢关注