R匹配字符串模式中矩阵运算的向量化答案

【问题标题】：Vectorization of matrix operation in R matching string patternsR匹配字符串模式中矩阵运算的向量化
【发布时间】：2019-04-02 10:29:21
【问题描述】：

我正在使用下面的代码创建一个矩阵，该矩阵比较一个向量中的所有字符串，以查看它们是否包含第二个向量中的任何模式：

strngs <- c("hello there", "welcome", "how are you")
pattern <- c("h", "e", "o")

M <- matrix(nrow = length(strngs), ncol = length(pattern))

for(i in 1:length(strngs)){
  for(j in 1:length(pattern)){
    M[i, j]<-str_count(strngs[i], pattern[j])
  }
}

M

效果很好，并返回我正在寻找的矩阵：

      [,1] [,2] [,3]

[1,]    2    3    1

[2,]    0    2    1

[3,]    1    1    2

但是，我的真实数据集非常庞大，这样的循环不能很好地扩展到具有 117、746、754 值的矩阵。有谁知道我可以将它矢量化或以其他方式加速它的方法？还是我应该只学习 C++？ ;)

谢谢！

【问题讨论】：

首先你应该能够通过使用stringi::stri_count_fixed()而不是stringr::str_count()来实现一些加速
太棒了，谢谢。出于好奇，你是怎么知道的？只是想学习。
因为你的问题是关于速度的，如果你提供一个相关大小的向量会很好（除了当前的小例子，它确实很好地显示了预期的结果！），例如x <- rep(strngs, 1e6)（还有更多模式？）以及关于所需时间的一些想法。干杯

标签： r loops matrix vectorization

【解决方案1】：

您可以按照@snoram 的建议使用outer 和stri_count_fixed。

outer(strngs, pattern, stringi::stri_count_fixed)
#     [,1] [,2] [,3]
#[1,]    2    3    1
#[2,]    0    2    1
#[3,]    1    1    2

【讨论】：

【解决方案2】：

通过移除内部循环并切换到stringi（stringr 是基于它构建的），这是一些边际改进。

M <- matrix(0L, nrow = length(strngs), ncol = length(pattern))
for(i in 1:length(strngs)) {
  M[i, ] <- stringi::stri_count_fixed(strngs[i], pattern)
}

然后是更标准的 R 方式：

t(sapply(strngs, stringi::stri_count_fixed, pattern))

【讨论】：

【解决方案3】：

另一个解决方案，sapply。基本上是snoram's solution。

t(sapply(strngs, stringi::stri_count_fixed, pattern))
#            [,1] [,2] [,3]
#hello there    2    3    1
#welcome        0    2    1
#how are you    1    1    2

测试。

由于一共有4种方式，这里有一些速度测试。

f0 <- function(){
  M<-matrix(nrow=length(strngs),ncol=length(pattern))
  for(i in 1:length(strngs)){
    for(j in 1:length(pattern)){
      M[i,j]<-stringr::str_count(strngs[i],pattern[j])
    }
  }
  M
}

f1 <- function(){
  M <- matrix(0L, nrow = length(strngs), ncol = length(pattern), )
  for(i in 1:length(strngs)) {
    M[i, ] <- stringi::stri_count_fixed(strngs[i], pattern)
  }
  M
}

f2 <- function() outer(strngs, pattern, stringi::stri_count_fixed)

f3 <- function() t(sapply(strngs, stringi::stri_count_fixed, pattern))

r0 <- f0()
r1 <- f1()
r2 <- f2()
r3 <- f3()

identical(r0, r1)
identical(r0, r2)
identical(r0, r3)  # FALSE, the return has rownames


library(microbenchmark)
library(ggplot2)

mb <- microbenchmark(
  op = f0(),
  snoram = f1(),
  markus = f2(),
  rui = f3()
)

mb
#Unit: microseconds
#   expr     min       lq      mean   median       uq     max
#     op 333.425 338.8705 348.23310 341.7700 345.8060 542.699
# snoram  47.923  50.8250  53.96677  54.8500  56.3870  69.903
# markus  27.502  29.8005  33.17537  34.3670  35.7490  54.095
#    rui  68.994  72.3020  76.77452  73.4845  77.1825 215.328

autoplot(mb)

【讨论】：

我们不需要更大的数据来让测试具有代表性吗？
@snoram 也许吧。有时（很多次？）小型和大型数据集的代码执行速度存在差异。并不总是对一种问题规模的最佳解决方案是对另一种规模的最佳解决方案。
实际的模式向量是9000个值（基本上是9000个最常用的英文单词），实际的strngs向量是13000个短语。所以一个 rep(pattern, 3000) 和 rep(strngs, 4300) 会产生一个大小相似的矩阵。
对。但实际上，我认为 Markus 的解决方案没有理由不继续优于迄今为止出现的其他解决方案。
我也是。它在大约 30 秒内执行操作，这与我在双循环时的 20 分钟有天壤之别。