正则表达式花费太多时间在 R 中编译答案

【问题标题】：Regular expression taking too much time to compile in R正则表达式花费太多时间在 R 中编译
【发布时间】：2015-07-28 09:32:34
【问题描述】：

我使用

在 ratingFile 中读取了一个文件

ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")

文件的前几行看起来像

  0000000125  1478759   9.2  The Shawshank Redemption (1994)
  0000000125  1014575   9.2  The Godfather (1972)
  0000000124  683611   9.0  The Godfather: Part II (1974)
  0000000124  1451861   8.9  The Dark Knight (2008)
  0000000124  1150611   8.9  Pulp Fiction (1994)
  0000000133  750978   8.9  Schindler's List (1993)

使用我提取的正则表达式

  match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
  match <- regmatches(ratingsFile,match)


  next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
  next_match <- regmatches(ratingsFile,next_match)

匹配的示例输出看起来像

  "0000000125" "1014575"    "9.2"        "The"        "Godfather"  "1972"

为了清理这些数据并更改为我需要的表单

  movies_name <- character(0)
  rating <- character(0)
  for(i in 1:length(match)){

      match[[i]]<-match[[i]][-1:-3] #for removing not need cols 
      len <- length(match[[i]])
      match[[i]]<-match[[i]][-len]#removing last column also not needed
      movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
      #appending movies name
      rating <- append(rating,next_match[[i]]) 
      #appending rating
}

现在这个最后的代码块执行时间太长了。我已经离开了他的编译过程几个小时，但仍然没有完成，因为文件有 636497 行长。

在这种情况下如何减少编译时间？

【问题讨论】：

你能准确描述你想要做什么吗？（而不是让我们猜测，阅读您的代码）
我想减少最后一段代码的编译时间。
我想通了，但是你能告诉我这个块在做什么，更好的是，你到底想做什么，这样我们就可以向你展示如何更有效地做到这一点 = 也许只是加快最后一个循环，但更有可能的是，用更有效的方法替换最后一个循环，也可能之前的步骤可以做同样的事情......
先生，我做了一些编辑。

标签： regex r time-complexity text-mining

【解决方案1】：

试试这个：

ratingsFile <- readLines(n = 6)
0000000125  1478759   9.2  The Shawshank Redemption (1994)
0000000125  1014575   9.2  The Godfather (1972)
0000000124  683611   9.0  The Godfather: Part II (1974)
0000000124  1451861   8.9  The Dark Knight (2008)
0000000124  1150611   8.9  Pulp Fiction (1994)
0000000133  750978   8.9  Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\\d{10}\\s+\\d+\\s+([0-9.]+)\\s+(.*?)\\s\\(\\d{4}\\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
#   rating               movie_name
# 1    9.2 The Shawshank Redemption
# 2    9.2            The Godfather
# 3    9.0   The Godfather: Part II
# 4    8.9          The Dark Knight
# 5    8.9             Pulp Fiction
# 6    8.9         Schindler's List

【讨论】：

【解决方案2】：

如果我正确理解您想要做什么（仅获取电影标题），这里是获得您想要的另一种选择：

unlist(lapply(strsplit(ratingsFile, "\\s{2,}"), # split each line whenever there are at least 2 spaces
                                 function(x){ # for each resulting vector
                                    x <- gsub(" \\(\\d{4}\\)$", "", tail(x, 1)) # keep only the needed part (movie title)
                                    x
                                 }))

# [1] "The Shawshank Redemption" "The Godfather"            "The Godfather: Part II"   "The Dark Knight"          "Pulp Fiction"            
# [6] "Schindler's List"

注意：请注意，您可以将生成的向量放入 data.frame 和/或保留前几行中的其他信息。

【讨论】：

【解决方案3】：

如果你想从你的数据中查找和使用一些数据，我认为你可以使用这个正则表达式：

/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm

有替换

$1 => 第一列
$2 => 第二列
$3 => 第三列（可能是评级）
$4 => 电影名称
$5 => 电影年

[Regex Demo]

【讨论】：