从 R 中的用户定义语料库中删除停用词答案

【问题标题】：Removing stopwords from a user-defined corpus in R从 R 中的用户定义语料库中删除停用词
【发布时间】：2016-05-30 13:11:09
【问题描述】：

我有一组文件：

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

在这组文档中，我想删除停用词。我已经删除了标点符号并转换为小写，使用：

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

首先我转换为一个 Corpus 对象：

documents <- Corpus(VectorSource(documents))

然后我尝试删除停用词：

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

但最后一行导致以下错误：

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试。

这已经被问到here，但没有给出答案。这个错误是什么意思？

编辑

是的，我正在使用 tm 包。

这里是 sessionInfo() 的输出：

R 版本 3.0.2 (2013-09-25) 平台：x86_64-apple-darwin10.8.0（64位）

【问题讨论】：

标签： r tm topic-modeling

【解决方案1】：

当我遇到tm 问题时，我通常只编辑原始文本。

要删除单词有点尴尬，但是您可以将tm 的停用词列表中的正则表达式粘贴在一起。

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

【讨论】：

非常感谢您的回复。我收到错误“字符串必须是原子向量”，与 stringr::str_replace_all 一致。知道如何解决这个问题吗？
啊哈！刚刚回答了我自己的问题：documents1 = paste(c(documents)) 在 stopwords_regex 部分之前粘贴该行。再次感谢！
首先感谢您的精彩回答。在捆绑在一起之前反转停用词列表会有所帮助。喜欢stopwords_regex = paste(rev(stopwords('en')), collapse = '\\b|\\b')

【解决方案2】：

也许可以尝试使用tm_map 函数来转换文档。它似乎对我有用。

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

这会产生

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "

【讨论】：

感谢 Elyasin，但我已经在使用 tm 包，它是 tm_map(documents, removeWords, stopwords("english")) 引发错误。
我知道。但是更仔细地看看我的答案。我得到了一个合理的结果，在删除标点符号和停用词之前，命令是documents = tm_map(documents, content_transformer(tolower))。试试看。
我又看了一遍，好像根本不能用tm_map。有时，它不会出错，我可以通过您的方法删除停用词，但有时它会引发相同的错误（“进程已分叉......”）。我以前从未遇到过这样的间歇性错误。有什么想法吗？
您使用的是哪个版本的 R？在哪个操作系统上？
在您的 .Rprofile 文件中或在您的 R 脚本顶部尝试此 options(mc.cores=1)。据我记得，在参与者使用 tm 的课程中，这是一种避免出现奇怪错误消息的解决方法。

【解决方案3】：

您可以使用 quanteda 包删除停用词，但首先确保您的词是标记，然后使用以下内容：

library(quanteda)
x<- tokens_select(x,stopwords(), selection=)

【讨论】：