stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别？ [关闭]答案

【问题标题】：What's the difference between the str_detect function in stringer and grepl and grep? [closed]stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别？ [关闭]
【发布时间】：2019-08-08 12:32:47
【问题描述】：

我开始在我的工作中进行大量的字符串匹配，我很好奇这三个函数之间的区别是什么，以及在什么情况下有人会使用其中一个而不是另一个。

【问题讨论】：

你查看过官方文档吗？ rdocumentation.org/packages/stringr/versions/1.4.0/topics/…
我的理解是，就结果而言，它们非常相似。然而，stringr 包实际上只是提供了一致/用户友好的功能，它们是stringi 包的包装器。我的理解是这些往往更快。
我将首先深入研究?str_detect、?grepl、?grep、?str_which、?match/%in%。并且一定要查看 stringr 包文档。

标签： r stringr grepl

【解决方案1】：

stringr 是“一套一致、简单且易于使用的包装器，围绕着奇妙的'stringi' 包”(from package description)。 stringi 的主要优点是与基本R 相比，该软件包具有令人难以置信的速度——stringr 在很大程度上继承了它。函数的输出在 base 和 stringr 中是一样的。

我使用stringi 生成一些随机文本进行演示：

library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)

grep 提供模式在字符向量中的位置，就像它等效的 str_which 所做的那样：

grep("Lorem", sample_small)
#> [1]  1  9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1]  1  9 14 32 45 50 65 93 94

另一方面，grepl/str_detect 为您提供向量的每个元素的信息，如果它包含字符串或不包含字符串。

grepl("Lorem", sample_small)
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE

在许多情况下，不同的结果可能会对您产生影响。如果我有兴趣将新列添加到包含有关不同列是否包含模式的信息的 data.frame 中，我通常使用grepl。 grepl 使这更容易，因为它与输入变量具有相同的长度：

df <- data.frame(sample = sample_small,
                 stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)

这样，一些更复杂的测试是可能的：

which(df$lorem & df$ipsum)
#> [1]  1  5 15 53 71 75

或者直接作为filter规则：

df %>% 
  filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))

现在关于为什么要使用 stringr 而不是 base，我认为有两个论点：不同的语法使 stringr 与管道一起使用更容易

library(dplyr)
sample_small %>% 
  str_detect("Lorem")

相比：

sample_small %>% 
  grepl("Lorem", .)

并且stringr 比 base 快大约 5 倍（对于我们正在研究的两个函数）：

sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
  base = grep("Lorem", sample_big),
  stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          674ms    674ms      1.48     415KB        0
#> 2 stringr       141ms    142ms      6.99     806KB        0


bench::mark(
  base = grepl("Lorem", sample_big),
  stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          679ms    679ms      1.47     391KB        0
#> 2 stringr       146ms    148ms      6.76     391KB        0

当我们查找完全匹配时，差异就更加显着了（默认是查找正则表达式）

bench::mark(
  base = grepl("Lorem", sample_big, fixed = TRUE),
  stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          336ms  338.1ms      2.96     391KB        0
#> 2 stringr      12.4ms   12.6ms     79.1      417KB        0

不过，我认为基函数对它们有一定的魅力，这就是为什么我在快速编写代码时仍然经常使用它们的原因。选项fixed = TRUE 就是一个例子。将fixed() 包裹在图案周围对我来说感觉有点尴尬。其他示例是grep 中的选项value = TRUE（我让你自己弄清楚），最后是ignore.case = TRUE，在stringr 中看起来有点尴尬：

str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#>  [1]  1  5  6  8  9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97

但是，这对我来说很尴尬的原因可能只是因为我在学习stringr之前使用了基础R。

要考虑的另一点是，使用stringi，您可以拥有更多的整体功能。因此，如果您决心开始进行字符串操作，您可能会立即开始学习该软件包 - 尽管教程较少，而且可能会更难弄清楚一些事情。

【讨论】：

谢谢你，这对我很有帮助，给了我很多关于前进的思考和阅读！
我无法重现您的基准。对于 str_detect 和 str_which，我认为它们与其基本对应物之间没有显着差异。使用fixed() 比fixed = TRUE 快；但是在 base 中设置 perl = TRUE 比任何 stringr 版本都快得多。在 Windows 上，R 4.0.3。
这很奇怪。我刚才重复了基准测试，在 R 4.0.3、stringr1.4.0 下得到了几乎相同的结果。还在 rstudio.cloud 上对其进行了测试，以防我的计算机做一些奇怪的事情。我可以确认perl = TRUE 改变了图片。我从来没有真正使用过它。也许有一个隐藏的权衡？
在你的帖子的第一部分，包括你写的关于包stringi和stringr的第一个代码块，但是现在它的编写方式听起来你可能把包混在一起了。
我添加了一个小说明，但我不知道你的意思。 stringr 建立在 stringi 之上。不是相反...