从 DocumentTermMatrix 中删除停止短语答案

【问题标题】：Removing Stop Phrases from DocumentTermMatrix从 DocumentTermMatrix 中删除停止短语
【发布时间】：2018-07-14 06:07:22
【问题描述】：

下面，我对“原始”数据进行了基本的主题建模。我知道我可以使用 tm_map 删除停用词，但我无法弄清楚如何在二元标记化发生之后这样做。

library(topicmodels)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(tidytext)

data("crude")
words <- tm_map(crude, content_transformer(tolower))
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

#bigram tokenization
dtm <- DocumentTermMatrix(words,control = list(tokenize = BigramTokenizer))
ui = unique(dtm$i) 
dtm = dtm[ui,] #remove "empty" tweets

lda <- LDA(dtm, k = 2,control = list(seed = 7272))

topics <- tidy(lda, matrix = "beta")

##Graphs
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

#single
stopwords1<- stopwords("english") ##I actually use a custom list: read.csv("stopwords.txt", header = FALSE)
adnlstopwords1<-c("ny","new","york","yorks","state","nyc","nys")

#doubles
stopwords2<-levels(interaction(stopwords1,stopwords1,sep=' '))
adnlstopwords2<-c(stopwords2,c("new york", "york state", "in ny", "in new",
                  "new yorks"))

stopwords<-c(stopwords,adnlstopwords1,stopwords2,adnlstopwords2)

我的问题是如何从 dtm 中删除这些二元组而不使用 tm_map 或可能有什么解决方法。请注意，基于“纽约”的二元组可能不会出现在原始数据中，但对我的其他数据很重要。

【问题讨论】：

为了更清楚，我想在构建二元组之后再做，因为我想包括像“我关心”这样的二元组，但消除像“我不关心”这样的二元组。这就是为什么仅删除单个单词无法获得所需输出的原因。

标签： r n-gram topic-modeling corpus stop-words

【解决方案1】：

我从 R 中的“gofastR”包中发现了这个解决方案：

dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

但是，我仍然在结果中看到停用词。查看文档后，remove_stopwords 假设它有一个排序列表——您可以使用同一包中的 prep_stopwords() 函数来准备停用词/短语。

stopwords<-prep_stopwords(stopwords)
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

为了做到这一点和干。我们可以在代码的 tm_map 部分进行词干提取，去除 stepwords，如下：

stopwords<-prep_stopwords(stemDocument(stopwords))
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

因为这将阻止停用词，然后将匹配 dtm 中已经提取的词。

【讨论】：