查看了airport_distance 函数,发现它没有向量化。这不好,因为对于大型数据集,您将无法计算距离。您可能应该考虑编写一个矢量化函数。一个简单的例子是:
vec_dist <- function(df){
air <- unlist(df)
match1 <- dplyr::filter(airports, IATA%in%unique(air))
point <- match(air, match1$IATA)
lon <- matrix((match1$Longitude * pi/180)[point], ncol = 2)
lat <- matrix((match1$Latitude * pi/180)[point], ncol = 2)
radius <- 6373
dlon = lon[,2] - lon[,1]
dlat = lat[,2] - lat[,1]
a = (sin(dlat/2))^2 + cos(lat[,1]) * cos(lat[,2]) * (sin(dlon/2))^2
b = 2 * atan2(sqrt(a), sqrt(1 - a))
cbind(df, dist= radius * b)
}
vec_dist(df1)
orig_station dest_station dist
1 LAX SFO 543.3598
2 BUF MIA 1912.5540
3 ATL CAE 307.6851
4 DEN DEN 0.0000
5 ORD IND 285.6848
6 DEN DEN 0.0000
7 MEM MEM 0.0000
8 TYS TPA 882.3557
9 IAH IAH 0.0000
10 CID PDX 2500.2793
我为什么要考虑编写自己的函数?一个快速的基准测试给你一个想法:
microbenchmark::microbenchmark(vec_dist(df1),
unlist_Map=unlist(Map(airport_distance, df1$orig_station, df1$dest_station)),
apply_=apply(df1[c('orig_station', 'dest_station')], 1, function(x) airport_distance(x[1], x[2])),
vectorize=Vectorize(airport_distance)(df1$orig_station, df1$dest_station), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
vec_dist(df1) 3.176101 3.176101 3.536051 3.536051 3.896001 3.896001 2
unlist_Map 431.611700 431.611700 498.710251 498.710251 565.808801 565.808801 2
apply_ 572.807201 572.807201 577.864401 577.864401 582.921601 582.921601 2
vectorize 483.825801 483.825801 528.993851 528.993851 574.161900 574.161900 2
然而,这是在一个有 10 行的数据上运行它。如果数据以几乎相似的点增加会发生什么?
df1 <- df1[rep(1:10, each=100), ]
Unit: milliseconds
expr min lq mean median uq max neval
vec_dist(df1) 7.084901 7.084901 8.564601 8.564601 10.0443 10.0443 2
unlist_Map 45161.593601 45161.593601 45229.421051 45229.421051 45297.2485 45297.2485 2
apply_ 45536.644800 45536.644800 53869.454001 53869.454001 62202.2632 62202.2632 2
vectorize 45286.505601 45286.505601 51775.855502 51775.855502 58265.2054 58265.2054 2