基于 data.table 模拟中使用的 which() 和 rbinom() 加速 R 函数答案

【问题标题】：Speeding up R function based on which() and rbinom() used in a data.table simulation基于 data.table 模拟中使用的 which() 和 rbinom() 加速 R 函数
【发布时间】：2026-02-01 07:50:01
【问题描述】：

我需要帮助加快一个简单的函数，该函数使用 which() 和 rbinom() 根据每日生存概率和筑巢期计算巢的存活时间。我在一个闪亮的应用程序的 data.table 模拟中使用它，这条线真的，真的减慢了速度。

有问题的函数如下 - 它计算给定每日存活概率和潜伏期的巢将存活多长时间。该函数每天生成 1 和 0，其中 1 表示继续生存，0 表示失败。如果嵌套没有失败，该函数会返回完整的潜伏期，但如果确实失败，则返回嵌套失败的日期，并告诉我第一个 0 的位置。

# specify parameters for function
period<-28
prob.surv<-0.98

# survival function that returns how long a nest survives for in days

survival<-function(period,prob.surv){
  which(rbinom(period,1,prob.surv)==0)[1] %>% replace(is.na(.), period)}

然后我使用 data.table 在更长的函数中使用它——这里有一个简化的例子：

library(data.table)
# make a dt
dat <- data.table(nests = 1:4000)

# date incubation starts
dat[,inc.start:= round(rnorm(n=nrow(dat), 80, sd = 2))]

# date incubation ends
dat[,inc.end:= inc.start + (replicate(n=nrow(dat), survival(28, 0.98)))]

不确定使用这样的 replicate() 是否很好，但无法找到更好的解决方案。

因为这个函数在模拟中总共使用了 3/4 次，所以在代码中是一个非常大的瓶颈。

任何关于如何加快survival() 函数或在data.table 中更有效地使用它的建议将不胜感激！

【问题讨论】：

第一次二项式成功的时间有gemoetric distribution。生成单个几何变量而不是 period 二项式然后查询它们应该可以提高性能。

标签： r performance data.table

【解决方案1】：

到目前为止，最快的方法是使用几何分布，正如@Limey 在评论中所建议的那样（谢谢！）。这是一个稍微快一点的解决方案，一个使用rgeom 的更快的解决方案：

library(microbenchmark)
library(magrittr)
library(data.table)

# specify parameters for function
period<-28
prob.surv<-0.98

# survival function that returns how long a nest survives for in days
survival_old <- function(period,prob.surv){
  which(rbinom(period,1,prob.surv)==0)[1] %>% 
    replace(is.na(.), period)
}
survival_new <- function(period,prob.surv){
  out <- as.logical(rbinom(period, 1, prob.surv))
  ifelse(all(out), period, match(TRUE, out))
}

# make a dt
dat <- data.table(nests = 1:4000)
dat[,inc.start:= round(rnorm(n=nrow(dat), 80, sd = 2))]

在函数中包装三个备选方案以进行基准测试：

old <- function() {
  dat[,inc.end:= inc.start + (replicate(n=nrow(dat), survival_old(28, 0.98)))]
}
new <- function() {
  dat[, inc.end := sapply(inc.start, function(x) 
                          x + survival_new(28, 0.98))]
}
new2 <- function() {
  dat[, inc.end := rgeom(.N, 1 - .98)][
      , inc.end := fifelse(inc.end > 28, 28, inc.end)][
      , inc.end := inc.start + inc.end]
}

运行基准测试：

microbenchmark(old(), new(), new2())
#> Unit: milliseconds
#>    expr        min        lq       mean     median         uq         max neval
#>   old() 292.031991 359.66243 420.835407 388.794828 458.942608 1055.786569   100
#>   new()  26.675279  32.80020  37.404787  35.519712  39.365767   93.748481   100
#>  new2()   1.285475   1.68351   2.072952   1.808423   2.088271    6.959055   100

【讨论】：

太棒了，谢谢两位，我知道我一定有什么明显的遗漏。
很高兴它有帮助！随意接受您最终使用的答案。

【解决方案2】：

为了好玩，这里有一种方法，它保留原始的 rbinom 和 Rcpp 来循环遍历结果。这个想法是每个rbinom 调用都有开销，所以如果我们可以一次生成所有分布，我们将获得一些性能。然后Rcpp 用于利用短路循环。

Rcpp::cppFunction(code = 
                    "
IntegerVector cppWhich(const IntegerVector x, const int grps, const int period,const double prob) {
    IntegerVector out(grps);
    
    for (int i = 0; i < grps; i++) {
    const int start = i * period;
    bool criteria_met = FALSE;
      for (int j = start; j < start + 28; j++) {
        if (x(j) < prob) {
          out(i) = j + 1 - start;
          criteria_met = TRUE;
          break;
        }
      }
      if (!criteria_met) out(i) = period;
    }
    
    return(out);
}
    ")

dat[, inc.end := {
  rbinoms = rbinom(28L * .N, 1L, 0.98)
  inc.start + cppWhich(rbinoms, .N, 28L, 0.98)
}]

对于所有这些工作，它仍然比@Vincent 的rgeom 方法慢。在我的电脑上，它是new2() - 1ms； complicated_Rcpp - 5 毫秒；和new() - 22 毫秒。这总是提醒我应该多研究统计数据，因为rgeom 很棒。

【讨论】：