文本词干后词频不准确答案

【问题标题】：Word frequency not accurate after text stemming文本词干后词频不准确
【发布时间】：2017-02-21 19:10:12
【问题描述】：

感谢您花时间阅读我的帖子。新手在这里，这是我的第一个带有一些示例数据的 R 脚本。

library(tm)
library(hunspell)
library(stringr)

docs <- VCorpus(VectorSource('He is a nice player, She could be a better player. Playing basketball is fun. Well played! We could have played better. Wish we had better players!'))

input <- strsplit(as.character(docs), " ")
input <- unlist(input)
input <- hunspell_stem(input)
input <- word(input,-1)

input <- VCorpus(VectorSource(input))
docs <- input

docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)

这将返回以下结果：

character0 48 更好 3 打 3 篮球 1 描述 1
有趣 1 头脑 1 小时 1 语言 1 元 1 分钟 1 不错 1 起源 1 井 1 愿望 1 年 1

预期结果：

更好 3 玩 3 篮球 1 乐趣 1 语言 1 好 1 好 1 希望1

不确定这些词的来源（字符 0、描述、元、语言等）以及是否有办法摆脱它们？

基本上，我要做的是使用 hunspell 对语料库（数据源 sql server 表）应用词干提取，然后在词云中显示它们。任何帮助，将不胜感激。广东

【问题讨论】：

如果你逐行查看你的代码，你会看到input <- word(input,-1) 破坏了一些东西。 hunspell_stem 返回一个列表结构，其中每个元素对应一个词；但是，它可能会为一个单词返回多个词干。您可能想跳过中间部分并执行dtm <- TermDocumentMatrix(docs, control = list(stemming = function(x) sapply(hunspell_stem(x), tail, 1)))。
感谢您的帮助。它适用于上述示例，但如果我输入其他内容，则会收到以下错误消息：library(tm) library(hunspell) docs <- VCorpus(VectorSource('Thanks lukeA for your help!')) dtm <- TermDocumentMatrix(docs, control = list(stemming = function(x) sapply(hunspell_stem(x), tail, 1))) m <- as.matrix(dtm) sort(rowSums(m),decreasing=TRUE) Error in table(txt) : all arguments must have the same length 是因为字符串中没有词干吗？
那是因为没有返回 lukea 的词根。因此，如果您坚持使用 hunspell，您不仅要检查多个词干，而且还要检查无词干。

标签： r hunspell

【解决方案1】：

这就是您评论中的示例失败的原因：

library(tm) 
library(hunspell) 
hunspell_stem(strsplit('Thanks lukeA for your help!', "\\W")[[1]])
# [[1]]
# [1] "thank"
# 
# [[2]]
# character(0)
# 
# [[3]]
# [1] "for"
# 
# [[4]]
# [1] "your"
# 
# [[5]]
# [1] "help"

这是使它工作的一种方法：

docs <- VCorpus(VectorSource('Thanks lukeA for your help!')) 
myStem <- function(x) { 
  res <- hunspell_stem(x)
  idx <- which(lengths(res)==0)
  if (length(idx)>0)
    res[idx] <- x[idx]
  sapply(res, tail, 1) 
}
dtm <- TermDocumentMatrix(docs, control = list(stemming = myStem)) 
m <- as.matrix(dtm) 
sort(rowSums(m),decreasing=TRUE)
  # for help! lukea thank  your 
  #   1     1     1     1     1

如果没有词干，这将返回原始标记，如果有多个词干，则返回最后一个词干。

【讨论】：

再次感谢您。我真的不明白为什么用“test”替换上面的测试字符串会失败。 Hunspell 返回“t”作为词干，所以我希望返回“t”。如果我将“testing”作为字符串，它会正常工作，因为返回的词干将是“test”。有任何想法吗？非常感谢。
该函数返回两个词干“test”和“t”的最后一个词干（tail）。您可以使用head 返回第一个。但是，“t”会被默认字长过滤器过滤。这就是它失败的原因。您可以执行 `TermDocumentMatrix(docs, control = list(stemming = myStem, wordLengths = c(-Inf, Inf)))` 禁用按字长过滤。见?termFreq。