高于给定阈值的向量值的快速求和答案

【问题标题】：Fast sum of values of a vector above given thresholds高于给定阈值的向量值的快速求和
【发布时间】：2020-01-14 21:46:55
【问题描述】：

我有一个阈值向量 thresholds 和另一个向量 x。我想创建一个新向量，比如vec_sum，长度与thresholds 相同，它为thresholds 的每个元素存储大于该元素的x 值的总和。

最快的方法是什么？我这样做的天真方式是

vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds))
{
   vec_sum[i] <- sum(x[x>thresholds[i]])
}

如果有帮助，阈值已经排序。

【问题讨论】：

与您的尝试没有任何不同，但我会使用 sapply 而不是 for loop sapply(thresholds, function(i) sum(x[x>i]))
sorted <- sort(x); sum(x) - cumsum(sorted)[findInterval(thresholds, sorted)]（参考stackoverflow.com/questions/37617125/…）

标签： r performance loops vector

【解决方案1】：

这是另一个使用cumsum的解决方案：

f1 <- function(v, th){
    v2 <- v[order(v)]
    v2s <- rev(cumsum(rev(v2)))
    return(v2s[findInterval(th, v2) + 1])
}

以下是 Ronak 与其他答案（以及示例数据）的一些测试和比较：

f2 <- function(x, thresholds){
    if (all(x < thresholds[1])) return(rep(0, length(thresholds)))
    if (all(x > thresholds[length(thresholds)])) return(rep(sum(x), length(thresholds)))
    return(rev(cumsum(rev(tapply(x, 
        findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
}

test_th <- c(3, 5, 10)
test_x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20)

vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
    vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}

all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] TRUE


set.seed(123)
test_x <- rnorm(10000)
test_th <- sort(rnorm(100)) ## f2 requires sorted threshold values

vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
    vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] FALSE
# Warning message:
# In x - y : longer object length is not a multiple of shorter object length

library(microbenchmark)
microbenchmark(
    a = f1(test_x, test_th),
    b = f2(test_x, test_th)
)
# Unit: microseconds
#  expr      min       lq      mean   median       uq       max neval
#     a  587.116  682.864  900.3572  694.713  703.726 10647.206   100
#     b 1157.213 1203.063 1260.0663 1223.600 1258.552  2143.069   100

【讨论】：

【解决方案2】：

不确定这是否更快，但我们可以使用findInterval 将x 剪切为thresholds。我们使用tapply 取每组的sum 并反过来取cumsum。

as.integer(rev(cumsum(rev(tapply(x, 
          findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))

经过测试

thresholds <- c(3, 5, 10)
x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20) #1:20 in random order
vec_sum <- rep(NA,length(thresholds))

for(i in seq_along(thresholds)) {
  vec_sum[i] <- sum(x[x>thresholds[i]])
}
vec_sum
#[1] 204 195 155

使用建议的解决方案

as.integer(rev(cumsum(rev(tapply(x, 
          findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
#[1] 204 195 155

解释答案。 findInterval 返回 x 的每个值所属的组

findInterval(x, thresholds, left.open = TRUE)
#[1] 0 0 0 3 1 2 1 3 2 2 2 2 3 3 3 3 3 3 3 3

我们使用tapply得到每个组的sum

tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)
#  0   1   2   3 
#  6   9  40 155

0-group 应该被排除，因为它们小于threshold 的所有值（因此-1）。第 2 组也应该包含第 1 组的总和，第 3 组应该包含第 1 组和第 2 组的总和。所以我们reverse 序列并采用cumsum

cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1]))

#  3   2   1 
#155 195 204

为了以原始顺序获取它并与threshold匹配，我们再次reverse

rev(cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1])))
#  1   2   3 
#204 195 155

边缘案例：

如果所有值都低于阈值或所有值都高于阈值，我们可能需要进行额外检查并返回以下内容。

if (all(x < thresholds[1]))   rep(0, length(thresholds))
if (all(x > thresholds[length(thresholds)])) rep(sum(x), length(thresholds))

【讨论】：

我明白了，当所有 x 都小于阈值时，您的方法不起作用。
@cuttlefish44 是的，并且当所有 x 都大于阈值时。我更新了处理这些情况的答案。
当向量不是整数时，您的方法似乎无法正常工作。我使用了您方法的略微修改版本进行测试（请参阅我的帖子中的f2），我可能误解了您的方法。