替代复杂的 for 循环以提高性能答案

【问题标题】：Alternative to complex for loop to improve performance替代复杂的 for 循环以提高性能
【发布时间】：2016-04-16 20:56:35
【问题描述】：

以下代码使用 20x1 数据框，检查每一行是否有以下 6 行（即 i+1 到 i+7 行）中的任何一行大于 2 个点以下的 3 行（例如 i+1 - i+4 > 2)。如果为 true，则在新创建的 Signal 列上记录 1。

例如，对于第一行，它会检查是否：

第 2 行 > 第 5 行 + 2 或
第 3 行 > 第 6 行 + 2 或

...

第 7 行 > 第 10 行 + 2

如果可能，我想找到 for 循环的替代方法。我在一个大型数据库上运行这个模板代码，循环可能需要几个小时。请注意，循环的代码有点复杂，以避免循环超出边界。非常感谢 @Gregor 为整合这些内容提供的巨大帮助。

#Data
df <- data.frame(Price = c( 1221, 1220, 1220, 1217, 1216,  1218 , 1216, 1216, 1217, 1220, 1219, 1218, 1220, 1216, 1217, 1218, 1218, 1207, 1206, 1205))

#Inputs
Window = 6                # check up to this far below current row
IndexDifference = 3       # check row against another this far down
ValueDifference = 2       # for difference at least this big

#Define loop boundaries 
base_rows = 1:(nrow(df) - IndexDifference)  # can't check more than this
candidate_max = pmin(base_rows + Window, nrow(df) - IndexDifference) # for a given base row, this is the maximum row to start checking against

#Make Signal variable
df$Signal = rep(NA, nrow(df)) #pre-allocate variable
for (i in seq_along(base_rows)) {
  df$Signal[i] = as.numeric(
    any(
      df$Price[(i + 1):candidate_max[i]] - 
        df$Price[((i + 1):candidate_max[i]) + IndexDifference] > ValueDifference))}

【问题讨论】：

如果你有一个“信号”和一个“价格”向量，我应该更有效地避免[.data.frame 和[<-.data.frame。例如。将分配给单个列“data.frame”与简单向量进行比较：x1 = data.frame(col1 = integer(1e5)); x2 = integer(1e5); system.time( for(i in seq_len(nrow(x1))) x1$col1[i] = 1L ); system.time( for(i in seq_along(x2)) x2[i] = 1L )
谢谢亚历克西斯。我有几十个专栏。这是一个简化的例子。

标签： r performance loops for-loop

【解决方案1】：

这有点晚了，但以防万一。

我同意@alexis_laz 的观点，即计算的比较比必要的多。不过我认为这个想法可以更进一步，因为如果any 以滚动方式应用，也会进行不必要的计算。

关键是始终将给定行与另一特定行进行比较（在您的示例中为以下 3）。一旦我们知道该行的等价性是否成立，在给定窗口内包含它的任何其他行都应该被赋予值 1 (TRUE)。

这里有用的捷径是，如果对行j 等价成立并且使行i 为TRUE，并且行j 也在行i+1 的窗口内，那么i+1 也是TRUE （不需要知道窗口中其他点的状态）。我的意思是我们不需要为每一行的窗口确定any。如果我们知道行i的窗口中有多少个TRUE，对于行i+1，我们只需要确定离开窗口的点是否为TRUE，以及进入窗口的点是否为TRUE。本质上，我们使用Window-width 框过滤向量，然后只检查哪些条目在其窗口中至少有一个 TRUE 值（这可以一次完成，但让我们忽略它，因为额外的时间不重要）。

使用滚动总和，我们可以通过运行计数和包括/删除进入/离开窗口的点来有效地计算这一点。这就是@alexis_laz 的观察所在：可以预先计算出入/出点的状态。

为了让事情更具体，这里有一些代码。首先，让我复制您的原始循环、@Richard Telford 的答案和@alexis_laz 的答案并将它们包装到函数中（主要是为了个人方便而进行了轻微的重写，因此输出格式匹配，希望不会添加任何错误）：

f_G <- function(x, window, idiff, valdiff){
  base_rows = 1:(NROW(x) - idiff - 1)  # can't check more than this
candidate_max = pmin(base_rows + window, NROW(x) - idiff) # maximum row to start checking against
  out = rep(0, NROW(x)) #pre-allocate variable
  for (i in seq_along(base_rows)) {
    out[i] = as.numeric(any(x[(i + 1):candidate_max[i]]
           - x[((i + 1):candidate_max[i]) + idiff] > valdiff))}
  return(out)
}

f_RT <- function(x, window, idiff, valdiff){
  x0 <- cbind(x[-(1)][1:NROW(x)], sapply(2:window,
                                        function(i)x[-(1:i)][1:NROW(x)]))
  x1 <- sapply((idiff+1):(idiff+window),
              function(i)x[-(1:i)][1:NROW(x)])
  out <- as.numeric(apply((x0 - x1) > valdiff, 1, any, na.rm = TRUE))
  return(out)
}

f_AL <- function(x, window, idiff, valdiff){
  check = (x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff
  check <- c(check, rep(FALSE, idiff+1))
  out <- as.integer(sapply(seq_along(check),
                      function(i) any(check[i:min(length(check), (i + (window - 1)))])))
  return(out)
}

然后这里有两个函数来计算我上面描述的滚动和，在一个具有预先计算的差异的向量上（如@alexis_laz 建议的）。第一个使用filter 函数，而第二个使用RcppRoll 包中的roll_sum。

f_filt <- function(x, window, idiff, valdiff){
  ## calculate idiff differences once
  check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
  ## extend series to filter
  check <- c(check, rep(0, window+idiff))
  ## reverse series due to filter using "past" values
  ffilt <- rev(filter(rev(check), rep(1, window), sides=1))
  ## check if at least one
  out <- ifelse(na.omit(ffilt) > 0, 1, 0)
  return(out)
}

library(RcppRoll)
f_roll <- function(x, window, idiff, valdiff){
  ## calculate idiff differences once
  check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
  ## extend series to filter
  check <- c(check, rep(0, window+idiff))
  ## rolling window sum
  froll <- roll_sum(check, n=window, align="right")
  out <- ifelse(froll > 0, 1, 0)
  return(out)
}

作为快速检查，我们可以测试所有函数是否给出相同的答案：

f_G(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_RT(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_AL(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_filt(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_roll(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

现在让我们对它们进行基准测试。我还会增加要测试的行数。

library(microbenchmark)
w <- Window
idiff <- IndexDifference
vdiff <- ValueDifference

df2 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df2, w, idiff, vdiff),
               f_RT(df2, w, idiff, vdiff),
               f_AL(df2, w, idiff, vdiff),
               f_filt(df2, w, idiff, vdiff),
               f_roll(df2, w, idiff, vdiff)
               )
Unit: milliseconds
                         expr       min        lq      mean    median        uq       max neval   cld
    f_G(df2, w, idiff, vdiff) 395.80227 412.05120 419.88554 413.55551 417.84907 479.47306   100     e
   f_RT(df2, w, idiff, vdiff) 154.43919 192.99473 193.10029 195.61031 197.95933 236.27244   100   c  
   f_AL(df2, w, idiff, vdiff) 233.30237 244.01664 249.75449 245.07001 248.51249 319.04956   100    d 
 f_filt(df2, w, idiff, vdiff)  21.53997  22.51582  25.38218  22.59477  23.56873  63.48320   100  b   
 f_roll(df2, w, idiff, vdiff)  14.26333  14.35543  16.99302  15.24879  15.45127  55.49886   100 a

最后，我们看到我们在执行此操作时获得了相当不错的速度提升。以这种方式接近它的另一件巧妙的事情是，无论窗口大小如何，它都保持同样的效率（特别是，直接进行滚动求和；使用filter 确实会减慢一点，尽管它仍然很快）。

w <- 25 #Window
df3 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df3, w, idiff, vdiff),
               f_RT(df3, w, idiff, vdiff),
               f_AL(df3, w, idiff, vdiff),
               f_filt(df3, w, idiff, vdiff),
               f_roll(df3, w, idiff, vdiff)
               )
Unit: milliseconds
                         expr       min        lq      mean    median        uq       max neval   cld
    f_G(df3, w, idiff, vdiff) 487.65798 516.67700 537.54019 541.34459 551.52128 592.05720   100     e
   f_RT(df3, w, idiff, vdiff) 328.44934 366.76176 389.08534 401.39053 409.49376 518.94535   100    d 
   f_AL(df3, w, idiff, vdiff) 240.99006 258.66045 263.21317 260.09258 263.75917 319.02493   100   c  
 f_filt(df3, w, idiff, vdiff)  37.32291  37.41098  38.97167  37.47234  38.40989  79.51684   100  b   
 f_roll(df3, w, idiff, vdiff)  15.49264  15.52950  15.86283  15.55252  15.62852  19.77415   100 a

【讨论】：

【解决方案2】：

这个问题的一个解决方案是构建两个滞后列矩阵并从另一个中减去一个。这利用了 R 中的矢量化，应该很快。

df0 <- cbind(df$Price[-(1)][1:nrow(df)], sapply(2:Window, function(i)df$Price[-(1:i)][1:nrow(df)]))
df1 <- sapply((IndexDifference+1):(IndexDifference+Window), function(i)df$Price[-(1:i)][1:nrow(df)])  
df$Signal <- as.numeric(apply((df0 - df1) > ValueDifference, 1, any, na.rm = TRUE))
df$Signal

请注意，这与您的代码给出的结果并不完全相同，可能是因为当

i = 17
(i + 1):candidate_max[i]

评估为c(18, 17)，这可能不是您想要的。

【讨论】：

在数千行上运行它，最后丢失一个数据点不是问题。这很完美，非常感谢。

【解决方案3】：

在您的循环中，大多数Price[i] - Price[i + IndexDifference] > ValueDifference 被计算多次；在这种情况下（最后的代码）大多数比较进行了 6 次：

#    [i]  [i + IndexDifference]  [times calculated]   
#    Var1 Var2 Freq
#70     2    5    1
#88     3    6    2
#106    4    7    3
#124    5    8    4
#142    6    9    5
#160    7   10    6
#178    8   11    6
#196    9   12    6
#214   10   13    6
#232   11   14    6
#250   12   15    6
#268   13   16    6
#286   14   17    6
#304   15   18    6
#322   16   19    6
#340   17   20    6

另外，我想，这不仅仅是重复计算本身，而是重复分配（和子集）到“data.frame”。

相反，您可以计算一次差异和比较：

tmp = (df$Price[2:(nrow(df) - IndexDifference)] - 
      df$Price[(2 + IndexDifference):nrow(df)]) > ValueDifference

并以滚动方式申请any（注意您关于不越界的评论）：

as.integer(sapply(seq_along(tmp), 
                  function(i) any(tmp[i:min(length(tmp), (i + (Window - 1)))])))
#[1] 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
                #and 4 values are left (rows 17:20 that cannot be 
                #calculated based on the conditions) to be added as `NA`

比较列表：

#re-calculcated your 'base_rows' to not include row 17 as it exceeds tha 'IndexDifference'
base_rows = 1:(nrow(df) - IndexDifference - 1L)  
candidate_max = pmin(base_rows + Window, nrow(df) - IndexDifference) 

#set-up the tabulations for each comparison     
table_diffs = matrix(0L, 
                     base_rows[length(base_rows)] + 1L,
                     candidate_max[length(candidate_max)] + IndexDifference)
for(i in seq_along(base_rows)) { 
    ij = cbind((i + 1):candidate_max[i], ((i + 1):candidate_max[i]) + IndexDifference)
    table_diffs[ij] = table_diffs[ij] + 1L
}   
#format
subset(transform(as.data.frame(as.table(table_diffs)), 
                 Var1 = as.integer(Var1), 
                 Var2 = as.integer(Var2)), 
       Freq != 0L)

【讨论】：

非常感谢亚历克西斯！选择@Richard Telford 的回答作为对问题的回答，因为两者都给出相同的结果，但他的速度稍快（两者都经过测试）。