R中的（快速）词频矩阵答案

【问题标题】：(Fast) word frequency matrix in RR中的（快速）词频矩阵
【发布时间】：2014-06-22 03:29:59
【问题描述】：

我正在编写一个 R 程序，该程序涉及分析大量非结构化文本数据并创建词频矩阵。我一直在使用 qdap 包中的 wfm 和 wfdf 函数，但注意到这对于我的需求来说有点慢。看来，词频矩阵的产生是瓶颈。

我的函数代码如下。

library(qdap)
liwcr <- function(inputText, dict) {
  if(!file.exists(dict)) 
    stop("Dictionary file does not exist.")

  # Read in dictionary categories
  # Start by figuring out where the category list begins and ends
  dictionaryText <- readLines(dict)
  if(!length(grep("%", dictionaryText))==2)
    stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")

  catStart <- grep("%", dictionaryText)[1]
  catStop <- grep("%", dictionaryText)[2]
  dictLength <- length(dictionaryText)

  dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))

  wordCount <- word_count(inputText)

  outputFrame <- dictionaryCategories
  outputFrame["count"] <- 0

  # Now read in dictionary words

  no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
  dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)

  workingMatrix <- wfdf(inputText)
  for (i in workingMatrix[,1]) {
    if (i %in% dictionaryWords[, 1]) {
      occurrences <- 0
      foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
      foundCategories <- foundWord[1,2:no_col]
      for (w in foundCategories) {
        if (!is.na(w) & (!w=="")) {
          existingCount <- outputFrame[outputFrame$V1 == w,]$count
          outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
        }
      }
    }
  }
  return(outputFrame)
}

我意识到 for 循环效率低下，因此为了定位瓶颈，我在没有这部分代码的情况下对其进行了测试（简单地读取每个文本文件并生成词频矩阵），并且在提高速度的方式。示例：

library(qdap)
fn <- reports::folder(delete_me)
n <- 10000

lapply(1:n, function(i) {
    out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
    cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})

filename <- sprintf("tweet%s.txt", 1:n)

for(i in 1:length(filename)){
  print(filename[i])
  text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
  freq <- wfm(text)
}

输入文件是 Twitter 和 Facebook 状态发布。

有什么办法可以提高这段代码的速度吗？

EDIT2：由于机构限制，我无法发布任何原始数据。但是，只是为了说明我正在处理的内容：25k 文本文件，每个文件都包含来自单个 Twitter 用户的所有可用推文。还有另外 10 万个包含 Facebook 状态更新的文件，结构相同。

【问题讨论】：

看来您的问题可能很快就会结束。也许添加一个示例数据集和代码来分析它。然后询问如何改进该代码。这可能会让你的问题保持开放。
我想你可能想用tm 包创建一个TermDocumentMatrix。它们本质上是相同的，但tm 针对更大的数据集进行了优化（qdap 设计用于更小、更结构化的转录本数据集）。如果你真的想要wfm，那么使用as.wfm 强制TermDocumentMatrix。请参阅qdap-tm compatibility vignette 了解更多信息。
感谢@MarkMiller。问题是我正在设计一个希望提交给 CRAN 的包，因此它并不特定于单个数据集。我已更新以包含我的函数的代码。
感谢@TylerRinker。我会试一试。
您尚未提供数据，这可能会阻止第五个也是最后一个人关闭。我正在查看的是大量 for 循环。我猜你可以用一种更加矢量化的格式来做这些事情。我还猜测您可以一次读取所有数据并将其存储为具有不同文件标识符的 data.frame，因为这就是 qdap 的用途。因此，您将有一列用于text.var，一列用于grouping.var（后者是filename[i]）您可以使用lapply、setNames 和qdapTools::list2df 执行此操作。但请提供 3 条示例推文以寻求帮助。

标签： r performance text-analysis word-frequency qdap

【解决方案1】：

这是一个更快的qdap 方法和一个混合的qdap/tm 方法。我提供代码，然后提供每个时间。基本上，我一次读取所有内容并操作整个数据集。然后，如果您愿意，可以使用 split 将其拆分回来。

您应该提供问题的 MWE

library(qdap)
fn <- reports::folder(delete_me)
n <- 10000

lapply(1:n, function(i) {
    out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
    cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})

filename <- sprintf("tweet%s.txt", 1:n)

qdap 方法

tic <- Sys.time() ## time it

dat <- list2df(setNames(lapply(filename, function(x){
    readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")

difftime(Sys.time(), tic) ## time to read in

the_wfm <- with(dat, wfm(text, tweet))

difftime(Sys.time(), tic)  ## time to make wfm

定时qdap方法

> tic <- Sys.time() ## time it
> 
> dat <- list2df(setNames(lapply(filename, function(x){
+     readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
> 
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
> 
> the_wfm <- with(dat, wfm(text, tweet))
> 
> difftime(Sys.time(), tic)  ## time to make wfm
Time difference of 48.9238 secs

qdap-tm 组合方法

tic <- Sys.time() ## time it

dat <- list2df(setNames(lapply(filename, function(x){
    readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")

difftime(Sys.time(), tic) ## time to read in


tweet_corpus <- with(dat, as.Corpus(text, tweet))

tdm <- tm::TermDocumentMatrix(tweet_corpus,
    control = list(removePunctuation = TRUE,
    stopwords = FALSE))

difftime(Sys.time(), tic)  ## time to make TermDocumentMatrix

定时qdap-tm组合方法

> tic <- Sys.time() ## time it
> 
> dat <- list2df(setNames(lapply(filename, function(x){
+     readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
> 
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
> 
> 
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
> 
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+     control = list(removePunctuation = TRUE,
+     stopwords = FALSE))
> 
> difftime(Sys.time(), tic)  ## time to make TermDocumentMatrix
Time difference of 13.52377 secs

有一个qdap-tm Package Compatibility (-CLICK HERE-) 可以帮助用户在 qdap 和 tm 之间移动。正如您在 10000 条推文中看到的那样，组合方法的速度提高了约 3.5 倍。纯粹的tm 方法可能更快。此外，如果您希望 wfm 使用 as.wfm(tdm) 强制 TermDocumentMatrix。

您的代码虽然速度较慢，但因为它不是 R 做事的方式。我建议阅读有关 R 的一些附加信息，以更好地编写更快的代码。我目前正在使用我推荐的 Hadley Wickham 的 Advanced R。

【讨论】：

请注意，从 qdap 版本开始 >= 2.1.1 wfm 包含更多 tm 包作为后端。这大大加快了wfm 的速度。