【发布时间】:2015-02-13 20:18:08
【问题描述】:
我想使用正则表达式来捕获子字符串 - 我已经有了一个可行的解决方案,但我想知道是否有更快的解决方案。我将applyCaptureRegex 应用于一个包含大约 400.000 个条目的向量。
exampleData <- as.data.frame(c("[hg19:21:34809787-34809808:+]","[hg19:11:105851118-105851139:+]","[hg19:17:7482245-7482266:+]","[hg19:6:19839915-19839936:+]"))
captureRegex <- function(captRegEx,str){
sapply(regmatches(str,gregexpr(captRegEx,str))[[1]], function(m) regmatches(m,regexec(captRegEx,m)))
}
applyCaptureRegex <- function(mir,r){
mir <- unlist(apply(mir, 1, function(x) captureRegex(r,x[1])))
mir <- matrix(mir ,ncol=5, byrow = TRUE)
mir
}
用法和结果:
> captureRegex("\\[[a-z0-9]+:([0-9]+):([0-9]+)-([0-9]+):([-+])\\]","[hg19:12:125627828-125627847:-]")
$`[hg19:12:125627828-125627847:-]`
[1] "[hg19:12:125627828-125627847:-]" "12" "125627828" "125627847" "-"
> applyCaptureRegex(exampleData,"\\[[a-z0-9]+:([0-9]+):([0-9]+)-([0-9]+):([-+])\\]")
[,1] [,2] [,3] [,4] [,5]
[1,] "[hg19:21:34809787-34809808:+]" "21" "34809787" "34809808" "+"
[2,] "[hg19:11:105851118-105851139:+]" "11" "105851118" "105851139" "+"
[3,] "[hg19:17:7482245-7482266:+]" "17" "7482245" "7482266" "+"
[4,] "[hg19:6:19839915-19839936:+]" "6" "19839915" "19839936" "+"
谢谢!
【问题讨论】:
-
你为什么在 captureRegex 中做:
sapply(regmatches(str,gregexpr(captRegEx,str))[[1]], function(m) regmatches(m,regexec(captRegEx,m)))而不仅仅是:regmatches(str,regexec(captRegEx,str))[[1]]? -
我只见树木不见森林;)
-
好的,所以只需将
applyCaptureRegex更改为以下单行代码,您将获得很大的提升:do.call(rbind,lapply(charvecHere,function(x) regmatches(x,regexec(regularExpHere,x))[[1]]))(即使str_match方法仍然快 4 倍左右......)
标签: regex r performance apply