【发布时间】:2014-08-19 00:02:42
【问题描述】:
大家好,我在传导 LDA 时遇到了一点问题,因为出于某种原因,一旦我准备好进行分析,我就会出错。我会尽我所能完成我正在做的事情,不幸的是我无法提供数据,因为我使用的数据是专有数据。
dataset <- read.csv("proprietarydata.csv")
首先我做了一点清理 data$text 和 post 是类字符
dataset$text <- as.character(dataset$text)
post <- gsub("[^[:print:]]"," ",data$Post.Content)
post <- gsub("[^[:alnum:]]", " ",post)
帖子最终看起来像这样: `
`[1] "here is a string"
[2] "here is another string"
etc....`
然后我创建了以下功能进行更多清洁:
createdtm <- function(x){
myCorpus <- Corpus(VectorSource(x))
myCorpus <- tm_map(myCorpus,PlainTextDocument)
docs <- tm_map(myCorpus,tolower)
docs <- tm_map(docs, removeWords, stopwords(kind="SMART"))
docs <- tm_map(docs, removeWords, c("the"," the","will","can","regards","need","thanks","please","http"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
return(docs)}
predtm <- createdtm(post)
这最终会返回一个语料库,为每个文档提供类似的内容:
[[1]]
<<PlainTextDocument (metadata: 7)>>
Here text string
[[2]]
<<PlainTextDocument (metadata: 7)>>
Here another string
然后我通过创建 DocumentTermMatrix 为 LDA 做好准备
dtm <- DocumentTermMatrix(predtm)
inspect(dtm)
<<DocumentTermMatrix (documents: 14640, terms: 39972)>>
Non-/sparse entries: 381476/584808604
Sparsity : 100%
Maximal term length: 86
Weighting : term frequency (tf)
Docs truclientrre truddy trudi trudy true truebegin truecontrol
Terms
Docs truecrypt truecryptas trueimage truely truethis trulibraryref
Terms
Docs trumored truncate truncated truncatememory truncates
Terms
Docs truncatetableinautonomoustrx truncating trunk trunkhyper
Terms
Docs trunking trunkread trunks trunkswitch truss trust trustashtml
Terms
Docs trusted trustedbat trustedclient trustedclients
Terms
Docs trustedclientsjks trustedclientspwd trustedpublisher
Terms
Docs trustedreviews trustedsignon trusting trustiv trustlearn
Terms
Docs trustmanager trustpoint trusts truststorefile truststorepass
Terms
Docs trusty truth truthfully truths tryd tryed tryig tryin tryng
这对我来说看起来很奇怪,但这就是我一直这样做的方式。所以我最终继续前进并执行以下操作
run.lda <- LDA(dtm,4)
这会返回我的第一个错误
Error in LDA(dtm, 4) :
Each row of the input matrix needs to contain at least one non-zero entry
在研究了这个错误之后,我查看了这篇帖子 Remove empty documents from DocumentTermMatrix in R topicmodels? 我假设我已经控制了一切并且很兴奋,所以我按照链接中的步骤进行操作,然后
这个运行
rowTotals <- apply(dtm , 1, sum)
这不是
dtm.new <- dtm[rowTotals> 0]
它返回:
Error in `[.simple_triplet_matrix`(dtm, rowTotals > 0) :
Logical vector subscripting disabled for this object.
我知道我可能会发热,因为你们中的一些人可能会说这不是可重现的例子。请随时询问有关此问题的任何信息。这是我能做的最好的。
【问题讨论】: