计算 data.table 中两行之间的地理距离答案

【问题标题】：calculating geographic distance between two rows in data.table计算 data.table 中两行之间的地理距离
【发布时间】：2018-10-05 01:48:21
【问题描述】：

我的问题基本上是这样的：calculating distance between two row in a data.table 但我正在使用 data.table 语法而不是 for 循环来寻找答案。

我有一个这样的 data.table：

Lat      Lon      Time                   Bus
52.21808 20.96675 2018-04-20 21:27:26    3
52.25882 20.89850 2018-04-20 21:27:23    8
52.24347 21.08460 2018-04-20 21:27:27    1
52.21935 20.97186 2018-04-20 21:28:31    3
52.25808 20.89790 2018-04-20 21:28:32    8
52.24541 21.08522 2018-04-20 21:28:36    1

我想计算两个连续点之间的距离，按总线分组，使用例如geosphere 包中的 distGeo。所以像：

d[,distance:=distGeo(c(Lon, Lat), ???????),by=Bus]

编辑我得到一些有用的结果使用

d[,distance:=distGeo(cbind(Lon, Lat)),by=Bus]

但不完全正确：有一个警告，每个组的一个项目需要回收。有没有办法在每辆巴士的第一行或最后一行获得 NA？

EDIT 2 看起来我有。

d[,distance:=c(distGeo(cbind(Lon, Lat)),NA) ,by=Bus]

【问题讨论】：

如果每辆巴士正好有两个点，distGeo(c(Lon[1], Lat[1]), c(Lon[2], Lat[2]))，我猜。如果可能超过两点，也许看看?shift。我不熟悉 distGeo 的语法，上面的示例不容易复制粘贴到 R 中重现。
我相信，distGeo 确实将matrix 作为参数。而不是 c(Lon,Lat) maybe you should look into cbind(Lon,Lat)` 只是...在这种情况下，我认为您不需要第二个参数？？
@Onyambu 这似乎有效！唯一的事情是我收到一个警告，因为有一行未定义答案：“提供了 34 个项目以分配给‘距离’列中大小为 35 的组 1（回收后剩下 1 个项目）。”等等
这似乎有效？ c?或cbind?
d[,distance:=distGeo(cbind(Lon, Lat)),by=Bus] 有效。对于给定的行，我得到这一行和下一行之间的距离，对于最后一行，第一个距离被回收，这是误导性的，我宁愿在那里得到 NA，或者理想情况下在第一行得到 NA

标签： r data.table

【解决方案1】：

这是使用包gmt的解决方案：

require(data.table)
require(gmt)

set.seed(123)
some_latlon <- data.table(id = sample(x = 1:2, size = 10, replace = TRUE),
                          xfrom = runif(n = 10, min = 3, max = 6),
                          yfrom = runif(n = 10, min = 52, max = 54))

setkey(some_latlon, id)
some_latlon[, xto := c(xfrom[-1], NA), by = id]
some_latlon[, yto := c(yfrom[-1], NA), by = id]

some_latlon[, dist := geodist(Nfrom = yfrom, Efrom = xfrom,
                              Nto = yto, Eto = xto, units = "km"), by = id]

当然，您可以轻松删除列 xto 和 yto。高温

【讨论】：

我认为这与之前的答案类似，因为它会创建新列。

【解决方案2】：

通过将 Lat/Lon 行上移一位来创建两个新列：

setorder(dt, Bus)

dt[, `:=`(Lat_to = shift(Lat, type = "lead"),
          Lon_to = shift(Lon, type = "lead")),
     by = Bus]

使用我为this answer 编写的这个函数（它是一种更高效的 data.table 样式的半正弦计算）

dtHaversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
  radians <- pi/180
  lat_to <- lat_to * radians
  lat_from <- lat_from * radians
  lon_to <- lon_to * radians
  lon_from <- lon_from * radians
  dLat <- (lat_to - lat_from)
  dLon <- (lon_to - lon_from)
  a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
  return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}

应用它

dt[, dist := dtHaversine(Lat, Lon, Lat_to, Lon_to)]

dt
#         Lat      Lon       Date     Time Bus   Lat_to   Lon_to      dist
# 1: 52.24347 21.08460 2018-04-20 21:27:27   1 52.24541 21.08522 220.05566
# 2: 52.24541 21.08522 2018-04-20 21:28:36   1       NA       NA        NA
# 3: 52.21808 20.96675 2018-04-20 21:27:26   3 52.21935 20.97186 376.08498
# 4: 52.21935 20.97186 2018-04-20 21:28:31   3       NA       NA        NA
# 5: 52.25882 20.89850 2018-04-20 21:27:23   8 52.25808 20.89790  91.96366
# 6: 52.25808 20.89790 2018-04-20 21:28:32   8       NA       NA        NA

数据

library(data.table)

dt <- fread(
'Lat      Lon      Date         Time          Bus
52.21808 20.96675 2018-04-20 21:27:26    3
52.25882 20.89850 2018-04-20 21:27:23    8
52.24347 21.08460 2018-04-20 21:27:27    1
52.21935 20.97186 2018-04-20 21:28:31    3
52.25808 20.89790 2018-04-20 21:28:32    8
52.24541 21.08522 2018-04-20 21:28:36    1')

100 万行的示例

set.seed(123)
dt <- data.table(Lat = sample(-90:90, 1e6, replace = T),
                                 Lon = sample(-90:90, 1e6, replace = T),
                                 Bus = rep(1:5e5,2))


setorder(dt, Bus)
system.time({
    dt[, `:=`(Lat_to = shift(Lat, type = "lead"),
              Lon_to = shift(Lon, type = "lead")),
         by = Bus]
    dt[, dist := dtHaversine(Lat, Lon, Lat_to, Lon_to)] 
})
#  user  system elapsed 
# 7.985   0.033   8.020

【讨论】：

如果我有一个大数据集，创建两个新列会不会效率低下？我的实际数据集有 1.5 MB。
@MonikaP 应该很快； 1.5m 并不是很多行。只有一种方法可以让您找到答案 - 试试看。
@MonikaP 我在 100 万行上添加了一个示例来向您展示速度。

【解决方案3】：

geodist::geodist 也可以，而且比geosphere::distHaversine 快。

require(data.table)
require(microbenchmark)

d = 
fread( 
'
Lat,Lon,Time,Bus
52.21808,20.96675,2018-04-20 21:27:26,3
52.25882,20.89850,2018-04-20 21:27:23,8
52.24347,21.08460,2018-04-20 21:27:27,1
52.21935,20.97186,2018-04-20 21:28:31,3
52.25808,20.89790,2018-04-20 21:28:32,8
52.24541,21.08522,2018-04-20 21:28:36,1
')

setorder(d, Bus, Time)

microbenchmark(

 d[, dist_geodist := geodist::geodist(cbind(Lat, Lon),
      measure='haversine', sequential = TRUE) , by = Bus]
,
 d[,dist_geosphere := geosphere::distHaversine(cbind(Lon, Lat) ) , by=Bus]   
 )

Unit: microseconds
                                                                                                                expr      min
 d[, `:=`(dist_geodist, geodist::geodist(cbind(Lat, Lon), measure = "haversine",      sequential = TRUE)), by = Bus]  861.937
                                 d[, `:=`(dist_geosphere, geosphere::distHaversine(cbind(Lon,      Lat))), by = Bus] 1005.890
        lq      mean    median       uq      max neval cld
  868.7585  910.8999  875.4555  920.138 1463.567   100  a 
 1016.2335 1065.2952 1028.3775 1070.428 1738.151   100   b

【讨论】：