这有点晚了,但以防万一。
我同意@alexis_laz 的观点,即计算的比较比必要的多。不过我认为这个想法可以更进一步,因为如果any 以滚动方式应用,也会进行不必要的计算。
关键是始终将给定行与另一特定行进行比较(在您的示例中为以下 3)。一旦我们知道该行的等价性是否成立,在给定窗口内包含它的任何其他行都应该被赋予值 1 (TRUE)。
这里有用的捷径是,如果对行j 等价成立并且使行i 为TRUE,并且行j 也在行i+1 的窗口内,那么i+1 也是TRUE (不需要知道窗口中其他点的状态)。我的意思是我们不需要为每一行的窗口确定any。如果我们知道行i的窗口中有多少个TRUE,对于行i+1,我们只需要确定离开窗口的点是否为TRUE,以及进入窗口的点是否为TRUE。本质上,我们使用Window-width 框过滤向量,然后只检查哪些条目在其窗口中至少有一个 TRUE 值(这可以一次完成,但让我们忽略它,因为额外的时间不重要)。
使用滚动总和,我们可以通过运行计数和包括/删除进入/离开窗口的点来有效地计算这一点。这就是@alexis_laz 的观察所在:可以预先计算出入/出点的状态。
为了让事情更具体,这里有一些代码。首先,让我复制您的原始循环、@Richard Telford 的答案和@alexis_laz 的答案并将它们包装到函数中(主要是为了个人方便而进行了轻微的重写,因此输出格式匹配,希望不会添加任何错误):
f_G <- function(x, window, idiff, valdiff){
base_rows = 1:(NROW(x) - idiff - 1) # can't check more than this
candidate_max = pmin(base_rows + window, NROW(x) - idiff) # maximum row to start checking against
out = rep(0, NROW(x)) #pre-allocate variable
for (i in seq_along(base_rows)) {
out[i] = as.numeric(any(x[(i + 1):candidate_max[i]]
- x[((i + 1):candidate_max[i]) + idiff] > valdiff))}
return(out)
}
f_RT <- function(x, window, idiff, valdiff){
x0 <- cbind(x[-(1)][1:NROW(x)], sapply(2:window,
function(i)x[-(1:i)][1:NROW(x)]))
x1 <- sapply((idiff+1):(idiff+window),
function(i)x[-(1:i)][1:NROW(x)])
out <- as.numeric(apply((x0 - x1) > valdiff, 1, any, na.rm = TRUE))
return(out)
}
f_AL <- function(x, window, idiff, valdiff){
check = (x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff
check <- c(check, rep(FALSE, idiff+1))
out <- as.integer(sapply(seq_along(check),
function(i) any(check[i:min(length(check), (i + (window - 1)))])))
return(out)
}
然后这里有两个函数来计算我上面描述的滚动和,在一个具有预先计算的差异的向量上(如@alexis_laz 建议的)。第一个使用filter 函数,而第二个使用RcppRoll 包中的roll_sum。
f_filt <- function(x, window, idiff, valdiff){
## calculate idiff differences once
check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
## extend series to filter
check <- c(check, rep(0, window+idiff))
## reverse series due to filter using "past" values
ffilt <- rev(filter(rev(check), rep(1, window), sides=1))
## check if at least one
out <- ifelse(na.omit(ffilt) > 0, 1, 0)
return(out)
}
library(RcppRoll)
f_roll <- function(x, window, idiff, valdiff){
## calculate idiff differences once
check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
## extend series to filter
check <- c(check, rep(0, window+idiff))
## rolling window sum
froll <- roll_sum(check, n=window, align="right")
out <- ifelse(froll > 0, 1, 0)
return(out)
}
作为快速检查,我们可以测试所有函数是否给出相同的答案:
f_G(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_RT(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_AL(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_filt(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_roll(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
现在让我们对它们进行基准测试。我还会增加要测试的行数。
library(microbenchmark)
w <- Window
idiff <- IndexDifference
vdiff <- ValueDifference
df2 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df2, w, idiff, vdiff),
f_RT(df2, w, idiff, vdiff),
f_AL(df2, w, idiff, vdiff),
f_filt(df2, w, idiff, vdiff),
f_roll(df2, w, idiff, vdiff)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
f_G(df2, w, idiff, vdiff) 395.80227 412.05120 419.88554 413.55551 417.84907 479.47306 100 e
f_RT(df2, w, idiff, vdiff) 154.43919 192.99473 193.10029 195.61031 197.95933 236.27244 100 c
f_AL(df2, w, idiff, vdiff) 233.30237 244.01664 249.75449 245.07001 248.51249 319.04956 100 d
f_filt(df2, w, idiff, vdiff) 21.53997 22.51582 25.38218 22.59477 23.56873 63.48320 100 b
f_roll(df2, w, idiff, vdiff) 14.26333 14.35543 16.99302 15.24879 15.45127 55.49886 100 a
最后,我们看到我们在执行此操作时获得了相当不错的速度提升。以这种方式接近它的另一件巧妙的事情是,无论窗口大小如何,它都保持同样的效率(特别是,直接进行滚动求和;使用filter 确实会减慢一点,尽管它仍然很快)。
w <- 25 #Window
df3 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df3, w, idiff, vdiff),
f_RT(df3, w, idiff, vdiff),
f_AL(df3, w, idiff, vdiff),
f_filt(df3, w, idiff, vdiff),
f_roll(df3, w, idiff, vdiff)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
f_G(df3, w, idiff, vdiff) 487.65798 516.67700 537.54019 541.34459 551.52128 592.05720 100 e
f_RT(df3, w, idiff, vdiff) 328.44934 366.76176 389.08534 401.39053 409.49376 518.94535 100 d
f_AL(df3, w, idiff, vdiff) 240.99006 258.66045 263.21317 260.09258 263.75917 319.02493 100 c
f_filt(df3, w, idiff, vdiff) 37.32291 37.41098 38.97167 37.47234 38.40989 79.51684 100 b
f_roll(df3, w, idiff, vdiff) 15.49264 15.52950 15.86283 15.55252 15.62852 19.77415 100 a