删除 R 中的德语停用词答案

【问题标题】：Removing German stop words in R删除 R 中的德语停用词
【发布时间】：2018-08-21 14:32:01
【问题描述】：

我有带有 cmets 列的调查数据。我期待对回复进行情绪分析。问题是数据中有多种语言，我不知道如何从集合中消除多个语言停用词

'nps' 是我的数据源，nps$customer.feedback 是 cmets 列。

首先我将数据标记化

#TOKENISE
comments <- nps %>% 
  filter(!is.na(cusotmer.feedback)) %>% 
  select(cat, Comment) %>% 
  group_by(row_number(), cat) 

  comments <- comments %>% ungroup()

摆脱停用词

nps_words <-  nps_words %>% anti_join(stop_words, by = c('word'))

然后使用 Stemming 和 get_sentimets("bing") 按情绪显示字数。

 #stemgraph
  nps_words %>% 
  mutate(word = wordStem(word)) %>% 
  inner_join(get_sentiments("bing") %>% mutate(word = wordStem(word)), by = 
  c('word')) %>%
  count(cat, word, sentiment) %>%
  group_by(cat, sentiment) %>%
  top_n(7) %>%
  ungroup() %>%
  ggplot(aes(x=reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  facet_wrap( ~cat, scales = "free")  +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Word counts by Sentiment by Category - Bing (Stemmed)", x = 
  `"Words", y = "Count")`

但是，由于正在分析德语文本，“di”和“die”出现在“否定”图中。

有人可以帮忙吗？

我的目标是使用上述代码消除德语停用词。

【问题讨论】：

一个快速的谷歌查找this 有一个停用词列表。首先将您的 cmets 拆分为检测到的语言是否值得like this？

标签： r text text-mining text-analysis

【解决方案1】：

要回答您的问题，您可以这样做以删除德语停用词。使用停用词包：

your code
.....  
stop_german <- data.frame(word = stopwords::stopwords("de"), stringsAsFactors = FALSE)

nps_words <-  nps_words %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word"))

...
rest of code

但是，请意识到 tidytext 主要用于英语，而不是其他语言。使用德语文本进行词干和情感分析会给您错误的结果。必应情感仅适用于英语单词。像你一样做一个 inner_join 将删除大部分德语单词，因为它在英语中没有匹配项。但是有些匹配，比如“死”这个词（如果你使用德语停用词，你会删除它，意思是“谁”或“那个”）。但是如果你删除这个词，你可能会不小心删除了英文“die”（死亡）。

This SO post 提供了有关德国情绪分析的更多信息。

【讨论】：