在大型数据集上优化 sapply-grepl答案

【问题标题】：Optimise sapply-grepl on a large dataset在大型数据集上优化 sapply-grepl
【发布时间】：2019-09-24 20:12:48
【问题描述】：

我有一个大型数据集 (df) ~250.000 观察值，其中包括一个 cleanText 列（其中包含从任何数字、标点符号、大写字母等清除的文本），并且我有一个公司名称列表。我想检查 df$cleanText 中的每个观察值是否与列表中带有公司名称的公司匹配，并计算它找到的匹配数量并存储它。我的代码可以运行，但是执行大约需要 20 个小时，我觉得它可能会快很多。

到目前为止，我还没有弄清楚什么会起作用。

# Start for loop for each row in df
for(i in 1:nrow(df)){

# store matches in companyNameMatch, make sure the paste0 includes \\b to match whole strings
companyNameMatch <- sapply(list_Companies, function(x) grepl(paste0(x, "\\b"), as.character(df$cleanText[i])))

# Calculate the number of matches and store it
df$companyNameMatch[i] <- as.numeric(length(which(companyNameMatch != 0)))
}

我希望代码应该能够在几个小时左右的时间内运行。

示例

cleanText <- c("keeping a cool head takes practice nike",
               "playing soccer on these adidas",
               "just having a laugh",
               "nike and adidas perform better than crocs")

list_Companies <- c("nike", "adidas", "crocs", "puma")

对于 df$cleanText 中的每一行，sapply 函数应检查是否与 list_Companies 中的一行匹配。这种情况下的结果看起来是这样的：

df$companyNameMatch[1] = 1
df$companyNameMatch[2] = 1
df$companyNameMatch[3] = 0
df$companyNameMatch[4] = 3

【问题讨论】：

请分享一个可重现的最小示例并显示预期输出。
添加了一个简短的例子

标签： r regex grepl

【解决方案1】：

有了base R，我们可以遍历'listCompanies，使用grepl和Reduce将list的逻辑向量合二为一

Reduce(`+`, lapply(list_Companies, grepl, cleanText))
#[1] 1 1 0 3

或与tidyverse类似的选项

library(tidyverse)
map(list_Companies, str_detect, string = cleanText) %>% 
           reduce(`+`)

【讨论】：

reduce 似乎也能完美运行，非常感谢！

【解决方案2】：

您可以将sapply 与rowSums 一起使用

df$companyNameMatch <- rowSums(sapply(list_Companies, function(x) grepl(x, cleanText)))

使用microbenchmark-package我们可以看到这明显提高了速度：

Unit: microseconds
     expr      min       lq      mean   median       uq        max neval cld
  rowSums   65.382   78.496   132.345   93.511   119.55   1462.727   100  a 
 for_loop 6206.326 6920.394 11170.353 7340.814 10058.53 170440.373   100   b

【讨论】：

用 grepl(paste0(x, "\\b") 代替“x”，对我来说成功了，显然看起来确实快了很多！谢谢