R - 向量化条件替换答案

【问题标题】：R - vectorised conditional replaceR - 向量化条件替换
【发布时间】：2012-07-07 19:49:21
【问题描述】：

您好，我正在尝试操作一个数字列表，我想在没有 for 循环的情况下这样做，使用 R 中的快速本机操作。操作的伪代码是：

默认情况下，起始总数为 100（对于零内的每个块）

从第一个零到下一个零，当累计总数下降超过 2% 时，将所有后续数字替换为零。

到此为止所有零内的数字块

累计总和每次重置为100

例如，如果以下是我的数据：

d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);

结果是：

0 0 0 1 3 4 5 -1 2 3 -5 0 0 0 -2 -3 0 0 0 0 0 -1 -1 -1 0

目前我有一个带有 for 循环的实现，但是由于我的向量很长，所以性能很糟糕。

提前致谢。

这是一个运行示例代码：

d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
ans <- d;
running_total <- 100;
count <- 1;
max <- 100;
toggle <- FALSE;
processing <- FALSE;

for(i in d){
  if( i != 0 ){  
       processing <- TRUE; 
       if(toggle == TRUE){
          ans[count] = 0;  
       }
       else{
         running_total = running_total + i;
  
          if( running_total > max ){ max = running_total;}
          else if ( 0.98*max > running_total){
              toggle <- TRUE;  
          }
      }
   }

   if( i == 0 && processing == TRUE )
   { 
       running_total = 100; 
       max = 100;
       toggle <- FALSE;
   }
   count <- count + 1;
}
cat(ans)

【问题讨论】：

向我们展示您的 for 循环以及您迄今为止尝试过的内容
谢谢你，我用代码更新了帖子。感谢您的建议。
Reduce 函数对于向量的顺序处理很有用，但我不知道您要做什么。 max=100 赋值远高于输入向量中的任何数字，并且“处理”变量从未初始化，所以据我所知，在第一次遇到非零后，“切换”永远保持 TRUE。如果提供有关该问题的一些背景信息可能会有所帮助。
嗨 DWin 请接受我最诚挚的歉意。为了美化stackoverflow格式的代码，我不小心删除了几行。我已经使用工作版本对其进行了更新，并且切换和处理现在按预期工作。您应该能够复制粘贴并运行它。

标签： r for-loop logic vectorization conditional-statements

【解决方案1】：

我不确定如何将您的循环转换为矢量化操作。但是，有两个相当简单的选项可以大幅提高性能。第一种是简单地将您的循环放入R 函数中，并使用compiler 包对其进行预编译。第二个稍微复杂的选项是将R 循环转换为c++ 循环并使用Rcpp 包将其链接到R 函数。然后你调用一个R 函数将它传递给c++ 代码，这很快。我展示了这些选项和时间。我非常感谢来自 Rcpp listserv 的 Alexandre Bujard 的帮助，他帮助我解决了我不理解的指针问题。

首先，这是您的 R 循环作为函数，foo.r。

## Your R loop as a function
foo.r <- function(d) {
  ans <- d
  running_total <- 100
  count <- 1
  max <- 100
  toggle <- FALSE
  processing <- FALSE

  for(i in d){
    if(i != 0 ){
      processing <- TRUE
      if(toggle == TRUE){
        ans[count] <- 0
      } else {
        running_total = running_total + i;
        if (running_total > max) {
          max <- running_total
        } else if (0.98*max > running_total) {
          toggle <- TRUE
        }
      }
    }
    if(i == 0 && processing == TRUE) {
      running_total <- 100
      max <- 100
      toggle <- FALSE
    }
    count <- count + 1
  }
  return(ans)
}

现在我们可以加载compiler 包并编译函数并将其命名为foo.rcomp。

## load compiler package and compile your R loop
require(compiler)
foo.rcomp <- cmpfun(foo.r)

这就是编译路径所需的全部内容。这都是R，显然很容易。现在对于c++ 方法，我们使用Rcpp 包以及允许我们“内联”c++ 代码的inline 包。也就是说，我们不必制作源文件并编译它，我们只需将它包含在R代码中，编译就会为我们处理。

## load Rcpp package and inline for ease of linking
require(Rcpp)
require(inline)

## Rcpp version
src <- '
  const NumericVector xx(x);
  int n = xx.size();
  NumericVector res = clone(xx);
  int toggle = 0;
  int processing = 0;
  int tot = 100;
  int max = 100;

  typedef NumericVector::iterator vec_iterator;
  vec_iterator ixx = xx.begin();
  vec_iterator ires = res.begin();
  for (int i = 0; i < n; i++) {
    if (ixx[i] != 0) {
      processing = 1;
      if (toggle == 1) {
        ires[i] = 0;
      } else {
        tot += ixx[i];
        if (tot > max) {
          max = tot;
        } else if (.98 * max > tot) {
            toggle = 1;
          }
      }
    }

   if (ixx[i] == 0 && processing == 1) {
     tot = 100;
     max = 100;
     toggle = 0;
   }
  }
  return res;
'

foo.rcpp <- cxxfunction(signature(x = "numeric"), src, plugin = "Rcpp")

现在我们可以测试我们得到了预期的结果：

## demonstrate equivalence
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all.equal(foo.r(d), foo.rcpp(d))

最后，通过重复 10e4 次来创建更大版本的 d。然后我们可以运行三个不同的函数，纯R代码，编译R代码，以及链接到c++代码的R函数。

## make larger vector to test performance
dbig <- rep(d, 10^5)

system.time(res.r <- foo.r(dbig))
system.time(res.rcomp <- foo.rcomp(dbig))
system.time(res.rcpp <- foo.rcpp(dbig))

在我的系统上，给出：

> system.time(res.r <- foo.r(dbig))
   user  system elapsed 
  12.55    0.02   12.61 
> system.time(res.rcomp <- foo.rcomp(dbig))
   user  system elapsed 
   2.17    0.01    2.19 
> system.time(res.rcpp <- foo.rcpp(dbig))
   user  system elapsed 
   0.01    0.00    0.02

编译后的R代码大约是未编译的R代码在250万向量上运行时间的1/6。 c++ 代码比编译后的 R 代码要快几个数量级，只需 0.02 秒即可完成。除了初始设置之外，R 和c++ 中基本循环的语法几乎相同，因此您甚至不会失去清晰度。我怀疑即使您的部分或全部循环可以在R 中进行矢量化，您也会很想击败与c++ 相关联的R 函数的性能。最后，只是为了证明：

> all.equal(res.r, res.rcomp)
[1] TRUE
> all.equal(res.r, res.rcpp)
[1] TRUE

不同的函数返回相同的结果。

【讨论】：

嗯，有点像在整个“不要像 C++ 那样编写 R 并期望它有效”的论点中戳一个洞……这很好 - 我在这里学到了一些东西。谢谢。
同上感谢您花时间解释如何使用 C++
不客气。 @Chase我认为“不要像C++那样对R进行编程”至少在一般情况下仍然成立。 R 的强项仍然是易于制作原型。如果有一种很好的方法来向量化这个问题，我敢打赌，所需的代码总行数会减半。例如，您可以通过循环遍历每一行并求和来获得矩阵的行总和，或者只是 rowSums()。我还应该指出，要使Rcpp 解决方案正常工作，您需要一个c++ 编译器。可能内置在 *nix 上，但在 Windows 上你可以得到 Rtools。
是的 - 这就是为什么我认为矢量化和编译的组合最终将成为理想的解决方案。我仍在尝试使用 DWin 之前建议的 Reduce 函数转换此循环的某些元素。如果我运气好，会告诉你的。再次感谢大家对此进行调查。
我认为矢量化的困难在于一切似乎都取决于前面的步骤。这并不是说经过一番思考，您也许可以以另一种方式重做您的算法，但是从编码的角度来看，向量化并没有什么明显的意义。您不是对向量的每个元素都执行一些操作（这相对容易向量化），而是根据之前的结果执行不同的操作，这需要之前的结果可用。