【问题标题】:Quickest way to find closest elements in an array in R在R中查找数组中最近元素的最快方法
【发布时间】:2019-04-12 21:17:39
【问题描述】:

我想在 R 中找到最快的方法来识别 Ytimes 数组中最接近给定 Xtimes 值的元素的索引。

到目前为止,我一直在使用一个简单的 for 循环,但一定有更好的方法来做到这一点:

Xtimes <- c(1,5,8,10,15,19,23,34,45,51,55,57,78,120)
Ytimes <- seq(0,120,length.out = 1000)

YmatchIndex = array(0,length(Xtimes))
for (i in 1:length(Xtimes)) {
  YmatchIndex[i] = which.min(abs(Ytimes - Xtimes[i]))
}

print(Ytimes[YmatchIndex])

【问题讨论】:

    标签: arrays r search match


    【解决方案1】:

    强制性 Rcpp 解决方案。利用向量已排序且不包含重复项这一事实将O(n^2) 转换为O(n)。可能对您的应用程序实用,也可能不实用;)

    C++:

    #include <Rcpp.h>
    #include <cmath>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    IntegerVector closest_pts(NumericVector Xtimes, NumericVector Ytimes) {
      int xsize = Xtimes.size();
      int ysize = Ytimes.size();
      int y_ind = 0;
      double minval = R_PosInf;
      IntegerVector output(xsize);
      for(int x_ind = 0; x_ind < xsize; x_ind++) {
        while(std::abs(Ytimes[y_ind] - Xtimes[x_ind]) < minval) {
          minval = std::abs(Ytimes[y_ind] - Xtimes[x_ind]);
          y_ind++;
        }
        output[x_ind] = y_ind;
        minval = R_PosInf;
      }
      return output;
    }
    

    R:

    microbenchmark::microbenchmark(
      for_loop = {
        for (i in 1:length(Xtimes)) {
          which.min(abs(Ytimes - Xtimes[i]))
        }
      },
      apply    = sapply(Xtimes, function(x){which.min(abs(Ytimes - x))}),
      fndIntvl = {
        Y2 <- c(-Inf, Ytimes + c(diff(Ytimes)/2, Inf))
        Ytimes[ findInterval(Xtimes, Y2) ]
      },
      rcpp = closest_pts(Xtimes, Ytimes),
      times = 100
    )
    
    Unit: microseconds
         expr      min      lq     mean   median       uq      max neval cld
     for_loop 3321.840 3422.51 3584.452 3492.308 3624.748 10458.52   100   b
        apply   68.365   73.04  106.909   84.406   93.097  2345.26   100  a 
     fndIntvl   31.623   37.09   50.168   42.019   64.595   105.14   100  a 
         rcpp    2.431    3.37    5.647    4.301    8.259    10.76   100  a 
    
    identical(closest_pts(Xtimes, Ytimes), findInterval(Xtimes, Y2))
    # TRUE
    

    【讨论】:

      【解决方案2】:

      R 是矢量化的,因此请跳过for 循环。这节省了编写脚本和计算的时间。只需将 for 循环替换为 apply 函数即可。因为我们要返回一个一维向量,所以我们使用sapply

      YmatchIndex &lt;- sapply(Xtimes, function(x){which.min(abs(Ytimes - x))})


      证明apply 更快:

      library(microbenchmark)
      library(ggplot2)
      
      # set up data
      Xtimes <- c(1,5,8,10,15,19,23,34,45,51,55,57,78,120)
      Ytimes <- seq(0,120,length.out = 1000)
      
      # time it
      mbm <- microbenchmark(
        for_loop = for (i in 1:length(Xtimes)) {
          YmatchIndex[i] = which.min(abs(Ytimes - Xtimes[i]))
        },
        apply    = sapply(Xtimes, function(x){which.min(abs(Ytimes - x))}),
        times = 100
      )
      
      # plot
      autoplot(mbm)
      

      ?apply for more

      【讨论】:

      • (但不一定是... :-)
      【解决方案3】:

      我们可以使用findInterval 来有效地执行此操作。 (cut 也可以,但需要做更多工作)。

      首先,让我们偏移Ytimes 偏移量,以便我们可以找到最近的,而不是下一个较小的。我将首先演示假数据:

      y <- c(1,3,5,10,20)
      y2 <- c(-Inf, y + c(diff(y)/2, Inf))
      cbind(y, y2[-1])
      #       y     
      # [1,]  1  2.0
      # [2,]  3  4.0
      # [3,]  5  7.5
      # [4,] 10 15.0
      # [5,] 20  Inf
      findInterval(c(1, 1.9, 2.1, 8), y2)
      # [1] 1 1 2 4
      

      第二列(前面带有-Inf 将给我们休息。请注意,每列都位于相应值与其追随者之间。

      好的,让我们将其应用于您的向量:

      Y2 <- Ytimes + c(diff(Ytimes)/2, Inf)
      head(cbind(Ytimes, Y2))
      #         Ytimes         Y2
      # [1,] 0.0000000 0.06006006
      # [2,] 0.1201201 0.18018018
      # [3,] 0.2402402 0.30030030
      # [4,] 0.3603604 0.42042042
      # [5,] 0.4804805 0.54054054
      # [6,] 0.6006006 0.66066066
      
      Y2 <- c(-Inf, Ytimes + c(diff(Ytimes)/2, Inf))
      cbind(Xtimes, Y2[ findInterval(Xtimes, Y2) ])
      #       Xtimes            
      #  [1,]      1   0.9009009
      #  [2,]      5   4.9849850
      #  [3,]      8   7.9879880
      #  [4,]     10   9.9099099
      #  [5,]     15  14.9549550
      #  [6,]     19  18.9189189
      #  [7,]     23  22.8828829
      #  [8,]     34  33.9339339
      #  [9,]     45  44.9849850
      # [10,]     51  50.9909910
      # [11,]     55  54.9549550
      # [12,]     57  56.9969970
      # [13,]     78  77.8978979
      # [14,]    120 119.9399399
      

      (我使用cbind 只是为了并排演示,并不是必须的。)

      基准测试:

      mbm <- microbenchmark::microbenchmark(
        for_loop = {
          YmatchIndex <- array(0,length(Xtimes))
          for (i in 1:length(Xtimes)) {
            YmatchIndex[i] = which.min(abs(Ytimes - Xtimes[i]))
          }
        },
        apply    = sapply(Xtimes, function(x){which.min(abs(Ytimes - x))}),
        fndIntvl = {
          Y2 <- c(-Inf, Ytimes + c(diff(Ytimes)/2, Inf))
          Ytimes[ findInterval(Xtimes, Y2) ]
        },
        times = 100
      )
      mbm
      # Unit: microseconds
      #      expr    min     lq     mean  median      uq    max neval
      #  for_loop 2210.5 2346.8 2823.678 2444.80 3029.45 7800.7   100
      #     apply   48.8   58.7  100.455   65.55   91.50 2568.7   100
      #  fndIntvl   18.3   23.4   34.059   29.80   40.30   83.4   100
      ggplot2::autoplot(mbm)
      

      【讨论】:

      • 哇,好东西,我不知道链接 findInterval 显然存在
      • 顺便说一句,Ytimes[ findInterval(Xtimes, Ytimes) ] 也可以,不知道你需要什么 Y2 - 没有它会更快
      • @ohemjeh,你说你想要最近的。使用Yimes 将返回间隔,但不一定是最近的。 (1:3)[findInterval(1.9, 1:3)] 应该返回 2,但它会返回 1,因为 1.9 介于 1 和 2 之间(不是最接近 2)。如果这是一个足够大的差异,就交给你了。
      猜你喜欢
      • 2019-07-07
      • 1970-01-01
      • 2013-12-26
      • 2011-12-09
      • 2012-05-07
      • 1970-01-01
      • 2022-01-05
      • 1970-01-01
      • 2017-03-15
      相关资源
      最近更新 更多