【发布时间】:2022-01-21 12:51:24
【问题描述】:
这是我的代码的 sn-p:
library(gutenbergr)
library(tm)
Alice <- gutenberg_download(c(11))
Alice <- Corpus(VectorSource(Alice))
cleanAlice <- tm_map(Alice, removeWords, stopwords('english'))
cleanAlice <- tm_map(cleanAlice, removeWords, c('Alice'))
cleanAlice <- tm_map(cleanAlice, tolower)
cleanAlice <- tm_map(cleanAlice, removePunctuation)
cleanAlice <- tm_map(cleanAlice, stripWhitespace)
dtm1 <- TermDocumentMatrix(cleanAlice)
dtm1
然后我收到以下错误:
<<TermDocumentMatrix (terms: 3271, documents: 2)>>
Non-/sparse entries: 3271/3271
Sparsity : 50%
Error in nchar(Terms(x), type = "chars") :
invalid multibyte string, element 12
我应该如何处理这个问题?我应该先将语料库转换为纯文本文档吗?书的文字格式有问题吗?
【问题讨论】:
标签: r matrix text-mining