如何提高循环操作的性能答案

【问题标题】：How to increase performance of loop operation如何提高循环操作的性能
【发布时间】：2019-10-29 21:28:47
【问题描述】：

我的 R 代码的性能存在问题，该代码会检查我的数据框是否与某些条件组合。对于我的数据框中的每一行，我需要该行的变量“A”大于或等于所有其他行的变量“B”的所有组合。最后我需要一个包含所有组合的 3 列的矩阵：

列：变量A中的行数
列：变量B小于A的行号
列：-1

我需要检查每一行。当您看到我的代码时，也许我的问题会变得更清楚。

Z <-data.frame(index1=NA,index2=NA,index3=NA)

for(i in 1:nrow(my.data)){

  interim_result <- my.data[i,"A"] >= my.data$B
  if(sum(is.na(interim_result))!=length(interim_result)){

    Y <- rbind(rep(i, sum(interim_result*1)), which(interim_result == TRUE), rep(-1, sum(interim_result)))
    print(i)
    Y <- t(Y)
    colnames(Y) <- c("index1","index2","index3")
    Z <- rbind(Z,Y)
  }
}

我检查了我的代码，它运行良好，但速度太慢了。我的数据框有大约 350K 行，计算需要很长时间。有人知道我可以加快速度吗？

【问题讨论】：

您在for-loop 中增长对象，这是非常低效的，不建议在R 中使用。请查看这些很棒的帖子以找到更好的方法：Efficient accumulation in R、Applying a function over rows of a data frame & Row-oriented workflows in R with the tidyverse
你打算用矩阵做什么？

标签： r performance loops runtime

【解决方案1】：

使用outer() 和which()。

set.seed(1)
n_rows <- 10
my.data <- data.frame(A = rnorm(n_rows), B = rnorm(n_rows))

mat <- which(outer(my.data[['A']], my.data[['B']], '>='), arr.ind = T)
colnames(mat) = c('index2', 'index1')

mat[, c('index1', 'index2')]

      index1 index2
 [1,]      1      4
 [2,]      2      4
 [3,]      2      7
 [4,]      2      8
 [5,]      2      9
 [6,]      3      2
 [7,]      3      4
 [8,]      3      5
 [9,]      3      7
... a total of 39 rows

我没有包含index3，因为它是一个常数。如果它总是-1，那么它没有多大用处。

通过将 ID 添加到原始 data.frame 并使用 lapply，我能够通过循环获得很大的速度。这让我可以跳过which 调用，也不必担心为Z 预分配

  my.data$ID <- seq_len(nrow(my.data))
do.call(rbind
        , lapply(seq_len(nrow(my.data))
                 , function (i) {
                   interim_result <- my.data[['ID']][my.data[i, "A"] >= my.data[['B']]]
                   if (length(interim_result) != 0) {
                     cbind(index1 = i,index2 = interim_result,index3 = -1)
                   }
                   }
                 )
)

最后，如果您进入data.table，您可以使用非等连接。

  dt <- as.data.table(my.data)

  dt[, ID := seq_len(.N)]

  dt[dt 
     , on = .(A >= B)
     , .(index1 = i.ID, index2 = ID, index3 = -1)
     , allow.cartesian = T
     ]

性能 10行data.frame：

Unit: microseconds
            expr     min         lq       mean     median         uq       max neval
   original_loop 12607.5 12687.6510 13420.3960 12843.0520 13260.4010 17939.301    20
      optim_loop   412.5   439.4515   695.5263   451.2510   462.0020  5345.802    20
          dt_way  3053.0  3140.7510  3269.0610  3268.9010  3351.2010  3667.601    20
 outer_statement    48.5    53.9005    65.7108    70.6505    72.7515    75.701    20

100 行数据帧：

Unit: microseconds
            expr       min        lq      mean     median        uq       max neval
   original_loop 42241.600 43560.001 48111.291 46051.7515 48297.301 79910.601    20
      optim_loop  3888.601  4010.551  4775.211  4107.6010  4299.400  9010.601    20
          dt_way  3356.902  3595.601  3857.906  3752.8505  3966.701  5330.101    20
 outer_statement   304.901   312.401   344.661   332.5005   348.701   473.000    20

1,000 行 - 删除原始循环：

Unit: milliseconds
            expr     min       lq     mean   median       uq     max neval
      optim_loop 55.0290 58.18355 60.50015 60.08140 62.47300 66.6332    20
          dt_way 29.1114 29.66050 32.19182 30.00790 30.88125 45.7993    20
 outer_statement 24.2323 24.44935 26.87686 24.64055 27.48775 35.9967    20

10,000 行：

Unit: seconds
            expr      min       lq     mean   median       uq      max neval
      optim_loop 2.233144 2.277568 2.401055 2.382523 2.496764 2.615275     5
          dt_way 3.622701 3.638953 3.660230 3.639226 3.649577 3.750691     5
 outer_statement 3.250272 3.353263 3.369732 3.375544 3.409773 3.459810     5

我的电脑在那之后就崩溃了。令我惊讶的是，优化循环开始取得一些进展。

【讨论】：