R中的迭代优化答案

【问题标题】：Optimization of iteration in RR中的迭代优化
【发布时间】：2018-02-05 17:04:00
【问题描述】：

前言：我确实有两个 csv 表，每个表包含 300 万行和大约 20 列，我想为所有满足特定要求的行提取 5 列。如果我使用 SQL 或其他一些数据库工具会更好，但是，嘿，我从 R 开始！我现在必须完成它。

目前我的请求在 R!-服务器上运行，大约 16 GB RAM - 明天第一个表的运行将达到一周的运行时间，大约 80% 已完成。

这让我想到了以下问题：我如何制定我的 if 子句有什么不同吗？目前我执行以下操作（省略加载 csv、准备数据框等）：

i = 1
while(i < length_csv){
   if((csv$column11[i] != condition1) && (csv$column11[i] != condition2) 
   && (csv$column11[i] != condition3) && (csv$column11[i] != condition4) 
   && (csv$column11[i] != condition5) && (csv$column11[i] != condition6) 
   && (csv$column11[i] != condition7) && (csv$column3[i] == condition8)){
      dataframe = rbind(dataframe,c(csv$column1[i],csv$column2[i],csv$column11[i],csv$column12[i],csv$column13[i]))
      }
   i = i + 1
}

如果请求嵌套这样会更有效

i = i+1
while(i < length_csv){
    if(csv$column3[i] == condition8){
        if(csv$column11[i] != condition1){
            if(csv$column11[i] != condition2){
                ... etc 
                }
    }
}

或者有没有其他方法来表达我可能忽略的请求？

【问题讨论】：

Erik，这是一个明显的例子： (1) 使用矢量化操作而不是 while 或 for 循环可能会显着提高性能； (2) 像这样重复 rbind 对于低计数和规模可怕效果很好； (3) 我们会从一个稍微好一点的例子中受益，包括一个小样本的数据。
请查看stackoverflow.com/questions/5963269/…，了解一些提供相关（但不是庞大）样本数据的方法。
假设csv$row11 和csv$row3 是数据中真正的列是否省事？虽然代码显然是在访问列，但名称中的 row 有点...关闭...
@r2evans：由于我目前只处理少量数据，所以我不知道缩放问题，谢谢。另外，是的，行 = 列，我将在稍后对其进行编辑。不，很遗憾，我无法提供数据样本，因为我只是在我的论文中得到它并且不允许传播它。否则我会这样做。
让我感到难过的是，您已经等待了一周的时间来等待几秒钟（或最多几分钟，具体取决于您的数据和条件）的事情。

标签： r if-statement optimization

【解决方案1】：

如果可能，我建议您避免 for 循环并重复 rbind 以过滤您的数据。使用一些示例数据：

set.seed(2)
n <- 1e4
df <- data.frame(
  row11 = sample(100, size=n, replace=TRUE),
  row3 = sample(100, size=n, replace=TRUE)
)
dim(df)
# [1] 10000     2
head(df)
#   row11 row3
# 1    19    5
# 2    71   27
# 3    58   31
# 4    17   52
# 5    95   37
# 6    95   79

矢量化它！

cond1 <- df$row11 > 30
cond2 <- df$row11 < 40
cond3 <- df$row3 > 10
cond4 <- df$row3 < 15
str(cond1)
#  logi [1:10000] FALSE TRUE TRUE FALSE TRUE TRUE ...
out1 <- df[ cond1 & cond2 & cond3 & cond4, ]
str(out1)
# 'data.frame': 31 obs. of  2 variables:
#  $ row11: int  39 35 37 33 37 36 32 34 32 37 ...
#  $ row3 : int  13 11 14 13 11 13 14 12 11 12 ...

（使用cond1等，作为预定义的logical向量是完全可选的。这与[...]括号内的文字条件一样有效。此外，我知道您的数据有更多列。 .. 这适用于更多列。）

要查看使用矢量化方法优于循环的好处（文字 for 或以类似方式使用 lapply）：

library(microbenchmark)
microbenchmark(
  vec = {
    cond1 <- df$row11 > 30
    cond2 <- df$row11 < 40
    cond3 <- df$row3 > 10
    cond4 <- df$row3 < 15
    df[ cond1 & cond2 & cond3 & cond4, ]
  },
  forloop = {
    out2 <- df[0,]
    for (i in seq_len(nrow(df))) {
      if (df$row11[i] > 30 && df$row11[i] < 40 &&
            df$row3[i] > 10 && df$row3[i] < 15) {
        out2 <- rbind(out2, df[i,,drop=FALSE])
      }
    }
  },
  lapp = {
    out3 <- lapply(seq_len(nrow(df)), function(i) {
      if (df$row11[i] > 30 && df$row11[i] < 40 &&
            df$row3[i] > 10 && df$row3[i] < 15) {
        df[i,,drop=FALSE]
      }
    })
    do.call(rbind, out3)
  }
)
# Unit: microseconds
#     expr        min         lq        mean      median          uq        max neval
#      vec    340.605    381.813    444.9889    409.1635    476.2635    758.519   100
#  forloop 142056.061 154749.407 169612.1311 165602.7955 178100.6755 254283.720   100
#     lapp 148903.885 161126.073 178910.3185 172380.4195 186945.8120 256529.009   100

这意味着我在大约 409 微秒内所做的工作，for 和 lapply 实现要高三个数量级。

更大的数据

对于更接近您的数据大小的演示：

set.seed(2)
# 3 million rows
nr <- 3e6
# 20 columns
nc <- 20
df <- as.data.frame(setNames(lapply(seq_len(nc), function(i) sample(100, size=nr, replace=TRUE)),
                             paste0("row", seq_len(nc))))
str(df)
# 'data.frame': 3000000 obs. of  20 variables:
#  $ row1 : int  19 71 58 17 95 95 13 84 47 55 ...
#  $ row2 : int  55 86 45 12 20 4 53 53 9 56 ...
#  $ row3 : int  78 100 93 86 67 61 45 41 82 32 ...
#  $ row4 : int  2 8 71 33 10 61 84 6 12 72 ...
#  $ row5 : int  31 27 32 75 100 54 80 2 52 10 ...
#  $ row6 : int  35 84 37 100 61 27 8 89 18 69 ...
#  $ row7 : int  100 28 54 34 18 68 25 96 8 9 ...
#  $ row8 : int  47 4 50 4 46 34 64 88 17 73 ...
#  $ row9 : int  45 91 13 1 78 17 40 78 81 39 ...
#  $ row10: int  31 41 87 60 30 30 22 99 85 44 ...
#  $ row11: int  83 90 10 51 88 27 21 48 87 27 ...
#  $ row12: int  94 83 44 53 58 41 39 5 93 6 ...
#  $ row13: int  65 90 8 55 85 100 14 41 44 99 ...
#  $ row14: int  39 29 18 32 87 80 32 62 22 12 ...
#  $ row15: int  33 15 58 46 7 4 61 35 32 60 ...
#  $ row16: int  22 17 58 27 24 56 83 59 22 44 ...
#  $ row17: int  38 28 7 40 95 21 13 53 78 64 ...
#  $ row18: int  64 12 88 55 36 68 84 16 82 15 ...
#  $ row19: int  48 53 75 62 61 31 36 23 4 18 ...
#  $ row20: int  25 89 1 11 10 40 24 50 50 66 ...

system.time({
  cond1 <- df$row11 > 30
  cond2 <- df$row11 < 40
  cond3 <- df$row3 > 10
  cond4 <- df$row3 < 15
  out1 <- df[ cond1 & cond2 & cond3 & cond4, ]
})
#    user  system elapsed 
#    0.14    0.04    0.18

在不到 1 秒的时间内将 3M 行减少到刚刚超过 10K：

str(out1)
# 'data.frame': 10685 obs. of  20 variables:
#  $ row1 : int  47 82 31 1 10 86 97 85 74 56 ...
#  $ row2 : int  42 5 48 1 48 10 11 18 11 94 ...
#  $ row3 : int  13 12 11 12 13 12 12 11 14 11 ...
#  $ row4 : int  75 29 66 53 21 2 78 52 39 87 ...
#  $ row5 : int  69 90 27 67 96 23 1 36 70 83 ...
#  $ row6 : int  95 77 34 99 26 63 78 100 23 42 ...
#  $ row7 : int  23 27 95 61 58 91 36 35 35 35 ...
#  $ row8 : int  57 92 47 23 69 49 1 44 29 99 ...
#  $ row9 : int  49 17 44 65 10 94 76 60 74 81 ...
#  $ row10: int  85 86 77 76 54 29 12 14 87 68 ...
#  $ row11: int  34 31 34 34 37 31 32 37 31 37 ...
#  $ row12: int  15 69 35 53 92 67 47 73 66 55 ...
#  $ row13: int  66 57 78 8 2 14 31 88 46 67 ...
#  $ row14: int  41 83 28 47 98 61 79 93 35 79 ...
#  $ row15: int  36 37 15 12 18 62 25 64 15 98 ...
#  $ row16: int  72 60 93 31 27 84 37 78 34 76 ...
#  $ row17: int  83 2 48 20 92 25 6 57 55 66 ...
#  $ row18: int  45 88 86 71 92 27 20 82 89 43 ...
#  $ row19: int  9 34 79 9 28 39 37 72 90 14 ...
#  $ row20: int  59 3 44 35 65 54 41 50 87 18 ...

【讨论】：

（顺便说一句，我很困惑，为什么lapply 实现比for 循环执行更糟糕，特别是因为rbind 的效率低下重复数百万次。在我看来，它应该稍微快一点。）
对不起，迟到的答案，我不得不做很多工作。我刚刚花了一些时间来了解您的代码，并将在明天将其应用到我的数据中，但到目前为止，这看起来既棒又短，非常感谢您分享您的知识。
好吧，我要去某个地方的沟里哭。你的矢量化方法用了令人震惊的 4 秒完成。非常感谢你，我会确保你的功劳，或者至少在我的论文中加入。

【解决方案2】：

您可以改进代码，避免在迭代的每个步骤中使用“rbind”。你应该避免增长你的对象（如果可能的话预先分配）并使用向量化操作。您可以尝试这样的事情（未测试，因为未提供数据的代表性示例）：

tst <- lapply(1:length_csv, function(i) {
  if((csv$row11[i] != condition1) && (csv$row11[i] != condition2) 
     && (csv$row11[i] != condition3) && (csv$row11[i] != condition4) 
     && (csv$row11[i] != condition5) && (csv$row11[i] != condition6) 
     && (csv$row11[i] != condition7) && (csv$row3[i] == condition8)) {
     out <- csv$row11[i]
     return(out)
  }

})

dataframe <- data.frame(do.call(rbind, tst))

【讨论】：