捕获正则表达式的更快方法答案

【问题标题】：Faster way to capture regex捕获正则表达式的更快方法
【发布时间】：2015-02-13 20:18:08
【问题描述】：

我想使用正则表达式来捕获子字符串 - 我已经有了一个可行的解决方案，但我想知道是否有更快的解决方案。我将applyCaptureRegex 应用于一个包含大约 400.000 个条目的向量。

 exampleData <- as.data.frame(c("[hg19:21:34809787-34809808:+]","[hg19:11:105851118-105851139:+]","[hg19:17:7482245-7482266:+]","[hg19:6:19839915-19839936:+]"))

captureRegex <- function(captRegEx,str){
  sapply(regmatches(str,gregexpr(captRegEx,str))[[1]], function(m) regmatches(m,regexec(captRegEx,m)))
}

applyCaptureRegex <- function(mir,r){
  mir <- unlist(apply(mir, 1, function(x) captureRegex(r,x[1])))
  mir <- matrix(mir ,ncol=5, byrow = TRUE)
  mir
}

用法和结果：

> captureRegex("\\[[a-z0-9]+:([0-9]+):([0-9]+)-([0-9]+):([-+])\\]","[hg19:12:125627828-125627847:-]")
$`[hg19:12:125627828-125627847:-]`
[1] "[hg19:12:125627828-125627847:-]" "12" "125627828" "125627847" "-"   

> applyCaptureRegex(exampleData,"\\[[a-z0-9]+:([0-9]+):([0-9]+)-([0-9]+):([-+])\\]")
     [,1]                              [,2] [,3]        [,4]        [,5]
[1,] "[hg19:21:34809787-34809808:+]"   "21" "34809787"  "34809808"  "+" 
[2,] "[hg19:11:105851118-105851139:+]" "11" "105851118" "105851139" "+" 
[3,] "[hg19:17:7482245-7482266:+]"     "17" "7482245"   "7482266"   "+" 
[4,] "[hg19:6:19839915-19839936:+]"    "6"  "19839915"  "19839936"  "+"

谢谢！

【问题讨论】：

你为什么在 captureRegex 中做：sapply(regmatches(str,gregexpr(captRegEx,str))[[1]], function(m) regmatches(m,regexec(captRegEx,m))) 而不仅仅是：regmatches(str,regexec(captRegEx,str))[[1]]？
我只见树木不见森林；）
好的，所以只需将 applyCaptureRegex 更改为以下单行代码，您将获得很大的提升：do.call(rbind,lapply(charvecHere,function(x) regmatches(x,regexec(regularExpHere,x))[[1]]))（即使 str_match 方法仍然快 4 倍左右......）

标签： regex r performance apply

【解决方案1】：

为什么要重新发明轮子？您有几个库包可供选择，这些函数返回一个字符矩阵，模式中的每个捕获组有一列。

stri_match_all_regex — stringi

x <- c('[hg19:21:34809787-34809808:+]', '[hg19:11:105851118-105851139:+]', '[hg19:17:7482245-7482266:+]', '[hg19:6:19839915-19839936:+]')
do.call(rbind, stri_match_all_regex(x, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]'))
#      [,1]                              [,2] [,3]        [,4]        [,5]
# [1,] "[hg19:21:34809787-34809808:+]"   "21" "34809787"  "34809808"  "+" 
# [2,] "[hg19:11:105851118-105851139:+]" "11" "105851118" "105851139" "+" 
# [3,] "[hg19:17:7482245-7482266:+]"     "17" "7482245"   "7482266"   "+" 
# [4,] "[hg19:6:19839915-19839936:+]"    "6"  "19839915"  "19839936"  "+"

str_match — stringr

str_match(x, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]')

strapplyc — gsubfn

strapplyc(x, "(\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])])", simplify = rbind)

以下是所有组合解决方案的基准比较。

x <- rep(c('[hg19:21:34809787-34809808:+]', 
           '[hg19:11:105851118-105851139:+]', 
           '[hg19:17:7482245-7482266:+]', 
           '[hg19:6:19839915-19839936:+]'), 1000)

applyCaptureRegex <- function(mir, r) {
  do.call(rbind, lapply(mir, function(x) regmatches(x, regexec(r, x))[[1]]))
}

gsubfn <- function(x1) strapplyc(x1, '(\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])])', simplify = rbind)
regmtch <- function(x1) applyCaptureRegex(x1, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]')
stringr <- function(x1) str_match(x1, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]')
stringi <- function(x1) do.call(rbind, stri_match_all_regex(x1, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]'))

require(microbenchmark)
microbenchmark(gsubfn(x), regmtch(x), stringr(x), stringi(x))

结果

Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval
  gsubfn(x) 372.27072 382.82179 391.21837 388.32396 396.27361 449.03091   100
 regmtch(x) 394.03164 409.87523 419.42936 417.76770 427.08208 456.92460   100
 stringr(x)  65.81644  70.28327  76.02298  75.43162  78.92567 116.18026   100
 stringi(x)  15.88171  16.53047  17.52434  16.96127  17.76007  23.94449   100

【讨论】：

我很好奇这与仅执行 strsplit(as.character(exampleData[[1]]),split = ":|-|]") 然后使用 data.table 中的 rbindlist 将它们拼接在一起的性能相比如何。
如果我之前知道这件事，这会让我安静一些时间。它确实快得多。完美的！谢谢！
@MineSweeper 我认为rbindlist 会根据需要强制执行某些操作，但显然不是。尽管如此，我的初步检查表明，即使只使用 do.call("rbind",...) 也比 str_match 快约 40%，如果您愿意的话。
@joran 越快越好 - 我以前从未在我的项目中使用过do.call。也许现在是时候了
@joran：strsplit 方法非常聪明，适用于这种情况，但如果您需要一些更复杂的正则表达式来标记您的字符串，它可能不适用......