R - 向量化嵌套的 For 循环答案

【问题标题】：R - Vectorize Nested For LoopR - 向量化嵌套的 For 循环
【发布时间】：2017-09-04 01:50:37
【问题描述】：

很抱歉另一个“循环矢量化”问题，但我无法弄清楚如何做到这一点。我要写的函数很简单：

对于enroll.in 中的每一行，首先使用hasMedClaims 逻辑模型输出作为响应概率。

生成随机数并使用它来确定是否应对响应进行建模。

如果是，则对响应进行建模。如果不是，只需输入一个 0。对每行的enroll.in 重复 nsim 次。

simMedClaims.loop<-function(hasMedClaims.in, MedClaims.in,  enroll.in, nsim = 100){

  set.seed(100)
  #dataframe to hold results
  results<-matrix(0, ncol = nsim, nrow = nrow(enroll.in))
  results<-data.frame(results)

  hasclaims<-predict(hasMedClaims.in, newdata = enroll.in, type = "response")
  means<-predict(MedClaims.in, newdata = enroll.in, type="response")
  for(ii in 1:nrow(enroll.in))
  {
    for(jj in 1:nsim){
      unif.rand<-runif(1)
      results[ii,jj]<-ifelse(unif.rand < hasclaims[ii], exp(rnorm(1,mean = means[ii], sd = sqrt(MedClaims.in$sig2))), 0)
    }

  }

  return(results)
}

set.seed(100)
dummy<-data.frame(hasresponse = rbinom(100000, 1, .5), response = rnorm(100000, mean = 5, sd = 1), x1 = runif(100000, 0, 60), x2 = as.factor(rbinom(100000, 1, .5)+1))
dummy$response<-dummy$hasresponse*dummy$response
hasresponse_gam<-mgcv::gam(hasresponse ~ s(x1,bs="ps", by=x2)+x2, data=dummy, family = binomial(link="logit"), method="REML")
response<-mgcv::gam(response ~ s(x1,bs="ps", by=x2)+x2, data=dummy[dummy$hasresponse==1,])
dummyEnroll<-data.frame(x1 = runif(10, 20, 50), x2 = as.factor(rbinom(10, 1, .5)+1))
system.time(result<-simMedClaims.loop(hasresponse_gam, response, dummyEnroll, 1000))

user  system elapsed 
38.66    0.00   39.35

我尝试了很多不同的想法，但每个想法都有不同的问题。

hasMedClaims.in 和 MedClaims.in 都是使用 mgcv gam 函数拟合的 GAM。

澄清我问这个问题的原因：如输出所示，每个受试者需要几秒钟来运行 1000 次模拟。我将在包含数万个主题的数据集上使用它，并且我想运行至少 50,000 次模拟。我当前的代码有效，但速度太慢了。我的目标是优化我的函数以更快地运行。

尝试@Parfait 的 func2

simMedClaims2<-function(hasMedClaims.in, MedClaims.in,  enroll.in, nsim = 100){
  set.seed(100)
  hasclaims<-predict(hasMedClaims.in, newdata = enroll.in, type = "response")
  means<-predict(MedClaims.in, newdata = enroll.in, type="response")
  results<-data.frame(t(vapply(seq(nrow(enroll.in)), function(ii, jj){
    ifelse(runif(jj) < hasclaims[ii],1,0)*exp(rnorm(nsim,mean = means[ii], sd = sqrt(MedClaims.in$sig2)))
  },numeric(nsim),seq(nsim))))
  return(results)
}

虽然我还没有完全审查结果，但结果看起来很合理。我还编辑了我的原始循环函数来计算循环外的平均值。更快

> system.time(result<-simMedClaims.loop(hasresponse_gam, response, dummyEnroll, 100))
   user  system elapsed 
   0.06    0.00    0.13
> system.time(result2<-simMedClaims2(hasresponse_gam, response, dummyEnroll, 100))
   user  system elapsed 
   0.02    0.00    0.02

但是，运行 all.equal(result, result2) 表明输出不相等。我不知道为什么会这样。

【问题讨论】：

你能提供 MWE 吗？
很遗憾，我无法分享我使用的任何数据。我应该添加什么？
不，不要使用您的数据，提供一个有效的虚拟示例 :)
我认为现在添加的内容应该足够了。我还澄清了我到底想要做什么。

标签： r for-loop vectorization

【解决方案1】：

考虑在sapply 或vapply 中传递两个向量参数以避免嵌套的for 循环并需要初始化results 数据框。当然apply family is truly vectorized：

simMedClaims.loop <- function(hasMedClaims.in, MedClaims.in, enroll.in, nsim = 100){

  hasclaims <- predict(hasMedClaims.in, newdata = enroll.in, type = "response")

  results <- data.frame(t(vapply(seq(nrow(enroll.in)), function(ii,jj) { 
                                      unif.rand <- runif(jj) 
                                      ifelse(unif.rand < hasclaims[ii], ..., 0)
                                  numeric(nsim), seq(nsim))))    
}

或者，考虑使用expand.grid() 方法，最后将其转换为所需的多列格式。尽管没有数据处理，这将被矢量化（不使用 R 循环，但可能使用 C 循环）。

simMedClaims.loop <- function(hasMedClaims.in, MedClaims.in, enroll.in, nsim = 100){

  hasclaims <- predict(hasMedClaims.in, newdata = enroll.in, type = "response")

  # LONG FORMAT
  df <- expand.grid(1:nrow(enroll.in), 1:nsim)
  df$unif.rand <- runif(nrow(df))
  df$val <- ifelse(df$unif.rand < hasclaims[ii], ..., 0)

  # WIDE FORMAT 
  results <- data.frame(t(sapply(seq(1, nrow(df), by=nsim), function(i) 
                                 df$random_num[i:(i+(nsim-1))])))

}

以上方法已经用随机数据进行了测试，返回的结果与嵌套的 for 循环相同（不包括 OP 的 predict 或 ifelse，因为没有 reproducible example）：

数据

enroll.in <- sapply(1:5, function(i) rnorm(15))
nsim <- 100

方法

func1 <- function() {      
  set.seed(98)
  results1<-matrix(0, ncol = nsim, nrow = nrow(enroll.in))
  results1<-data.frame(results1)

  for(ii in 1:nrow(enroll.in))
  {
   for(jj in 1:nsim){

     results1[ii,jj] <- runif(1)
   }
  }
  return(results1)
}

func2 <- function() {
  set.seed(98)
  results2 <- data.frame(t(vapply(seq(nrow(enroll.in)), function(ii,jj) 
                                       runif(jj), 
                                  numeric(nsim), seq(nsim))))
}

func3 <- function() {
  set.seed(98)
  df <- expand.grid(1:nrow(enroll.in), 1:nsim)
  df$random_num <- runif(nrow(df))

  results3 <- data.frame(t(sapply(seq(1, nrow(df), by=nsim), function(i) 
                                  df$random_num[i:(i+(nsim-1))])))
}

结果

all.equal(func1(), func2())
# [1] TRUE
all.equal(func2(), func3())
# [1] TRUE

基准测试表明，至少对于小数据，不同方法之间的处理并没有好多少。注意：大纳秒处理是由于函数set.seed() 以便比较随机生成的数据。所以古老的格言是：for 循环没有错：

library(microbenchmark)

microbenchmark(func1)
# Unit: nanoseconds
#   expr min lq  mean median uq max neval
#  func1  30 32 37.07     32 33 461   100

microbenchmark(func2)
# Unit: nanoseconds
#   expr min lq  mean median uq max neval
#  func2  29 31 39.41     32 33 729   100

microbenchmark(func3)
# Unit: nanoseconds
#   expr min lq mean median uq max neval
#  func3  30 31 35.6     32 33 370   100

【讨论】：

我做了一些补充来改善我的问题
您是否使用示例数据尝试过此解决方案？
那你还有什么问题？
唯一剩下的问题是循环函数和向量化函数不返回等效的结果。我不确定这只是随机化问题还是矢量化函数有其他问题
您需要在每次随机抽奖时set.seed()。您的ifelse() 有一个rnorm() 电话。要重现loop 和vapply，请在unif.rand <- runif(...) 之后但ifelse() 之前添加一个具有相同编号的set.seed()。另外，最终结构是否相同（nrows 和 ncols）？