避免 R - dplyr 解决方案中的慢循环？答案

【问题标题】：Avoiding a Slow Loop in R - dplyr Solution?避免 R - dplyr 解决方案中的慢循环？
【发布时间】：2020-03-12 12:59:08
【问题描述】：

我有一个问题，我可以用 R 中缓慢而笨拙的循环来解决。但是，我希望有一个更优雅（和更快）的解决方案......

我能想到的最简单的解释：每一行数据描述一个开关上的动作。这些行按开关 ID（开关 1、开关 2 等）和操作的时间顺序排序。每个开关可以在任何时间点打开或关闭。动作可以是“打开”、“关闭”或“离开”。对于每一行，我想知道该行描述的操作之前和之后的开关状态（打开或关闭）。

每个开关都从“关闭”位置开始。

（我正在使用的数据实际上与保险单数据相关，但这种基于开关的类比有效，并且可能更易于理解）

一个可重现的例子：

df <- data.frame(switch_id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3),
                  counter = c(1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4),
                  action = c("on", "off", "on", "off", "on", "same", "same", "same", "on", "same", "same", "same", "off", "off", "off", "on", "off", "same", "on"))

我可以使用不特别优雅的循环到达我想要的位置：

df$status_before <- NA
df$status_after <- NA

for(i in 1:nrow(df)) 
{

  if(df$counter[i] == 1)
  {
    df$status_before[i] <- FALSE # switch always starts in the "off" position
  }
  else
  {
    df$status_before[i] <- df$status_after[i-1]
  }

  if(df$action[i] == "on") {
    df$status_after[i] <- TRUE
  }
  else if(df$action[i] == "off")
  {
    df$status_after[i] <- FALSE  
  }
  else # "same"
  {
    df$status_after[i] <- df$status_before[i] # leave everything alone
  }

}

...但显然在 R 中最好避免循环，因为它们运行非常缓慢。当然，在这个小数据集中没关系，但我正在处理的真实数据有大约 100 万行，所以这可能是个问题。

是否有“矢量化”解决方案，可能使用dplyr 类型命令？

谢谢。

【问题讨论】：

标签： r loops dplyr

【解决方案1】：

这是一个 data.table 解决方案：

编辑：需要通过switch_id操作；从 data.table v.1.12.4 开始，有一种本机方式来填充此编辑中使用的缺失值 (nafill)；添加了一些cmets

library(data.table)
df <- data.table(switch_id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
    counter = c(1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7),
    action = c("on", "off", "on", "off", "on", "same", "same", "same", "on", "same", "same", "same", "off", "off", "off"))

# in "status_after", replace "same" by NA and set "off" and "on" to FALSE and TRUE
df[, status_after := as.logical(factor(action, labels=c(FALSE, TRUE, NA)))]

# fill in NA using last observation carried forward, by switch_id
df[, status_after := as.logical(nafill(+(status_after), type = "locf")), by = switch_id]

# status_before: shift status_after (default: lag one), by switch_id
df[, status_before := shift(status_after), by = switch_id]

# set first instance of status_before per switch_id to FALSE
df[, status_before := c(FALSE, status_before[-1]), by = switch_id]

# reorder columns
setcolorder(df, c(1:3, 5, 4))
df
#>     switch_id counter action status_before status_after
#>  1:         1       1     on         FALSE         TRUE
#>  2:         1       2    off          TRUE        FALSE
#>  3:         1       3     on         FALSE         TRUE
#>  4:         1       4    off          TRUE        FALSE
#>  5:         1       5     on         FALSE         TRUE
#>  6:         1       6   same          TRUE         TRUE
#>  7:         1       7   same          TRUE         TRUE
#>  8:         1       8   same          TRUE         TRUE
#>  9:         2       1     on         FALSE         TRUE
#> 10:         2       2   same          TRUE         TRUE
#> 11:         2       3   same          TRUE         TRUE
#> 12:         2       4   same          TRUE         TRUE
#> 13:         2       5    off          TRUE        FALSE
#> 14:         2       6    off         FALSE        FALSE
#> 15:         2       7    off         FALSE        FALSE

^{由reprex package (v0.3.0) 于 2020-03-12 创建}

【讨论】：

谢谢。只有一件事：第 9 行的 status_before 应该为 FALSE，因为我们已经从 switch_id = 1 移动到 switch_id = 2。有没有一种简单的方法可以做到这一点，即对于每个 switch_id 的第一条记录，status_before 总是设置为 FALSE？我只是说 df$status_before[df$counter == 1]
修复了答案的编辑 - 我错过了，对不起。
谢谢。有用。我对 data.table 方言不太熟悉，但也许我应该花点时间学习一下。
当您处理大数据时，您可能会喜欢 data.table 的速度和内存效率操作。下面是对其功能的介绍：cran.r-project.org/web/packages/data.table/vignettes/…。但是，如果您对tidyverse 方式更满意，您也可以选择（或者可以查看dtplyr）。

【解决方案2】：

据我了解，当我查看您的循环时，您希望在 status_before 和 TRUE/ FALSE 中依赖于前一个 counter 和在 status_after 中的操作 TRUE/ @ 987654327@取决于实际counter的动作。我做对了吗？不太确定您想要使用same 操作...

要查看前几行的值，您可以使用 dplyr 中的 lag() 函数（如果要“向前看”，请改用 lead()）。此代码提供与循环相同的输出：

已编辑：

# change "same" to last value of action (if you don't want to change the actual action column, create a new one)
df <- df %>%
  group_by(switch_id) %>%
  mutate(action = ifelse(action == "same", NA, action)) %>% # mark "same" as NA
  fill(action) # make sure action is a character string!

# do the actual evaluation
df <- df %>%
  group_by(switch_id) %>%
  mutate(status_before = case_when(lag(action) == "on" ~ "TRUE",
                                   lag(action) == "off" ~ "FALSE"),
         status_after = case_when(action == "on" ~ "TRUE",
                                  action == "off" ~ "FALSE"), 
         status_before = replace(status_before, is.na(status_before), "FALSE"))

现在应该是正确的！

【讨论】：

谢谢。我很快就会看一下代码。回答您的问题：如果操作“相同”，那么我希望 status_after 与 status_before 相同。
“相同”可以被视为“看开关但不要碰它”
谢谢。第一段代码似乎有效。但是我很难理解代码中显示 lag(action) == "same" ~ "TRUE" 的部分。为什么在这种情况下 status_before 总是为 TRUE？
也许我的示例数据集并没有完全涵盖所有可能性，因为“相同”仅出现在 status_before 为 TRUE 的情况下。但不一定总是这样。我已经编辑了我的原始帖子以包含第三个 switch_id，其中我们有“相同”，其中 status_before 为 FALSE。在这种情况下，您的代码现在为我提供不同的输出（status_before，第 19 行）
哦，我明白了 - 我更改了我的代码。现在应该没问题，但它不如@user12728748 数据表解决方案好（我假设也慢）。