for循环一个小标题需要太多时间答案

【问题标题】：for loop a tibble takes too much timefor循环一个小标题需要太多时间
【发布时间】：2021-11-06 15:18:19
【问题描述】：

对于我正在进行的项目，我正在分析两个数据集，每个数据集有 500,000 行。我必须根据一个特定列中的值过滤这些行。这是我编写的用于小标题的函数：


theme_analyser <- function(tibble_to_analyse) {
 
for (i in 1:nrow(tibble_to_analyse)) {
  theme <- unlist(strsplit((tibble_to_analyse$themes[i]), ";"))
  if (any(theme %in% themes_to_use)){
    next}
  else {
    tibble_to_analyse <- tibble_to_analyse[-i,]
  }
}  
}

在此函数中，themes_to_use 是一个包含一组字符串值的向量。主题列采用多个值，每个值之间用“;”分隔。因此，我首先将这些值拆分并取消列出。

此代码的问题在于它运行速度太慢。它设法在 18 小时内完成了仅 250k 行的工作。我有哪些方法可以加快这个过程，从而减少花费太多时间？

假设我有如下数据集：

A  B  
1  "bright" 
2  "shiny"
3  "bright"

我想过滤行，所以我只得到B 列等于“亮”的行。我的代码用于选择themes 列至少等于值向量的值之一的行。

提前谢谢你。

【问题讨论】：

请发布一些示例输入和预期输出，r.e. stackoverflow.com/help/minimal-reproducible-example
没有数据很难确定，但似乎grepl 可以让您识别要包含的案例的行号而无需循环。
每次满足else 条件时，您都在编写整个tibble_to_analyse（少一行）。不要那样做。相反，标记您要删除的行并在循环完成后删除这些行。

标签： r dataframe loops for-loop if-statement

【解决方案1】：

另一个使用来自tidyverse的dplyr/string的方法：

library(tidyverse)
tibble_to_analyse %>% 
  filter(str_detect(themes, paste(themes_to_use, collapse = "|"))) # Edit, thank you @Jared_mamrot

示例数据：

set.seed(123)
n = 10000
tibble_to_analyse = tibble(
  val1 = sample(c(LETTERS, themes_to_use), n, replace = TRUE),
  val2 = sample(c(LETTERS, themes_to_use), n, replace = TRUE),
  themes = paste(val1, val2, sep = ";"),
  values = 1:n
)

速度提升很多，但不如 @Jared_marot 的基本 R 解决方案快。

【讨论】：

我不是 100% 确定，但我认为您需要使用 tibble_to_analyse %>% filter(str_detect(themes, paste(themes_to_use, collapse = "|"))) 来获得“正确”答案
谢谢，乔恩。您的解决方案解决了问题。另外，您可以根据@jared_mamrot 的评论更改您的解决方案吗？他是对的。

【解决方案2】：

如果没有一个最小的可重现示例，很难说这个解决方案是否合适，但您的循环花费这么长时间的原因之一是，对于循环的每次迭代，您都在编写 tibble 并计算 theme <- unlist(strsplit((tibble_to_analyse$themes[i]), ";"))。

如果您通过对函数进行矢量化来绕过这些问题，它应该会明显更快 - 这是一个示例：

library(tidyverse)

set.seed(123)

df <- data.frame(themes = sample(c("one;theme", "two;theme",
                               "three;theme", "four;theme"),
                             size = 10000, replace = TRUE),
              values = rnorm(10000))

themes_to_use <- c("one", "three")
                         
theme_analyser <- function(tibble_to_analyse) {
  
  for (i in 1:nrow(tibble_to_analyse)) {
    theme <- unlist(strsplit((tibble_to_analyse$themes[i]), ";"))
    if (any(theme %in% themes_to_use)){
      next}
    else {
      tibble_to_analyse <- tibble_to_analyse[-i,]
    }
  }  
}

vectorised_theme_analyser <- function(tibble_to_analyse) {
  tibble_to_analyse[which(gsub(";.*", "\\1", tibble_to_analyse$themes) %in% themes_to_use),]
}

res <- microbenchmark::microbenchmark(vectorised_theme_analyser(df),
                                      theme_analyser(df))
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

^{由reprex package (v2.0.1) 于 2021-09-10 创建}

编辑

使用您问题中提供的简化示例数据，这是一个简化的子集方法并与@Jon Spring 的 tidy_detect 方法进行比较：

library(tidyverse)

set.seed(123)

df <- data.frame(themes = sample(c("one", "two",
                               "three", "four"),
                             size = 500000, replace = TRUE),
              values = rnorm(500000))

themes_to_use <- c("one", "three")

subset_in <- function(tibble_to_analyse) {
  tibble_to_analyse[tibble_to_analyse$themes %in% themes_to_use,]
}

tidy_detect <- function(tibble_to_analyse) {
  tibble_to_analyse %>% filter(str_detect(themes, paste(themes_to_use, collapse = "|")))
}

res <- microbenchmark::microbenchmark(subset_in(df),
                                      tidy_detect(df))
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

^{由reprex package (v2.0.1) 于 2021-09-10 创建}

【讨论】：