计算两个数据集两点之间的距离（最近邻）答案

【问题标题】：Calculate the distance between two points of two datasets (nearest neighbor)计算两个数据集两点之间的距离（最近邻）
【发布时间】：2019-04-09 04:53:01
【问题描述】：

我想计算两个不同数据集中两点之间的距离。我不想计算所有点之间的距离 - 只是到 datasetB 的最近点。
一些例子：

数据集 A - 人员
http://pastebin.com/HbaeqACi

数据集 B - 水景：
http://pastebin.com/UdDvNtHs

数据集 C - 城市：
http://pastebin.com/nATnkMRk

所以...我想计算每个人到最近的水景点的距离。
我已经尝试过使用 rgeos 包，在遇到一些预测错误之后，我已经开始使用它了。但是这个计算（至少我假设它）到每个点的所有距离，但是，正如已经说过的，我只对到最近的水景点的距离感兴趣。

# load csv files
persons = read.csv("persons.csv", header = TRUE)
water = read.csv("water.csv", header = TRUE)
# change dataframes to SpatialPointDataFrame and assign a projection
library(sp)
library(rgeos)
coordinates(persons) <- c("POINT_X", "POINT_Y")
proj4string(persons) <- CRS("+proj=utm +datum=WGS84")
coordinates(water) <- c("POINT_X", "POINT_Y")
proj4string(water) <- CRS("+proj=utm +datum=WGS84")

# use rgoes package to calculate the distance
distance <- gDistance(persons, water, byid=TRUE)
# works, but calculates a huge number of distances

有没有我错过的参数。还是我需要使用另一个包或功能？我还查看了 spatstat，它能够计算到最近邻居的距离，但不能计算两个不同数据集的距离：http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/spatstat/html/nndist.html

编辑：
完整的 R 脚本，包括绘制数据集：

library(RgoogleMaps)
library(ggplot2)
library(ggmap)
library(sp)
library(fossil)

#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("city.csv", header =TRUE)

# plot data
persons_ggplot2 <- persons
city_ggplot2 <- city
water_ggplot2 <- water
gc <- geocode('new york, usa')
center <- as.numeric(gc)  
G <- ggmap(get_googlemap(center = center, color = 'bw', scale = 1, zoom = 11, maptype = "terrain", frame=T), extent="device")
G1 <- G + geom_point(aes(x=POINT_X, y=POINT_Y ),data=city, shape = 22, color="black", fill = "yellow", size = 4) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=persons, shape = 8, color="red", size=2.5) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=water_ggplot2, color="blue", size=1)
plot(G1)

#### calculate distance
# Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)

# Generate a function that looks for the closest waterfeature for each id coordinates
NearestW <- function(id){
tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
WaterFeatures <- rbind(tmp,water[,2:3])
tmp1 <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
tmp1 <- which.min(tmp1)
tmp1 <- water[tmp1,1]
tmp1 <- data.frame(tmp1, WaterFeature=tmp)
return(tmp1)
}

#apply to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))

【问题讨论】：

标签： r distance spatial

【解决方案1】：

编写一个函数来为每个人寻找最近的水景怎么样？

#requires function earth.dist from "fossil" package
require(fossil)

#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)

#Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)


#Generate a function that looks for the closest waterfeature for each id coordinates
NearestW <- function(id){
   tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
   WaterFeatures <- rbind(tmp,water[,2:3])
   tmp1 <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
   tmp1 <- min(tmp1)
   tmp1 <- data.frame(tmp1, WaterFeature=tmp)
   return(tmp1)
 }

#apply to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))

注意：我在原始 read.csv 中添加了 stringsAsFactors 参数，它使最后的合并更容易

注意：tmp1 列记录了距离最近水景的 METERS 数

【讨论】：

感谢您的提示。不幸的是，我在执行persons$nearest <- sapply(persons$Id, NearestW) 后收到以下错误：$<-.data.frame(*tmp*, "nearest", value = list()) 中的错误：替换有 0 行，数据有 164。我已经用persons_nearest <- sapply(persons$Id, NearestW) 将它保存到一个新的data.frame，但输出是一个空列表。
运行NearestW("001")会得到什么输出？
通过运行 NearestW("001") 我得到以下输出：> [1] 0
我不确定是什么问题。我从您提供的链接中加载了数据，并且已经让它在我的机器上工作，但它的导入可能与您得到的不同。只是为了检查......在完成 read.csv 之后，人是 5 列宽，水是 3 列宽吗？
Argh...我什至太糟糕了，无法将您的代码正确粘贴到我的 R 脚本中（我错过了生成人员 ID 的行）。现在一切正常 :) 顺便说一句：你说过，通过计算每个唯一坐标集的最近东水特征然后合并它们，计算可以更快。这看起来怎么样？因为...我已经将您的方法转移到我的真实数据集（大约 2500 人和 40 个城市（因此这 2500 人位于 40 多个位置）并且计算需要 20 分钟，i5 @ 4*4Ghz oO。

【解决方案2】：

也许我来得太晚了，但您可以使用spatstat 来计算两个不同数据集之间的距离。命令是nncross。您必须使用的参数是ppp 类型的两个对象，您可以使用as.ppp() 函数创建它们。

【讨论】：

我问这个问题的原因可能为时已晚，但对未来肯定有帮助！谢谢！ :)
但是，如果您使用 spatstat，则首先投影到平面坐标是很重要的。它不会自动识别经度、纬度数据。但是，nncross 的 C 代码非常高效，因此您可能会在处理大型数据集时体验到显着的速度提升。