根据https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation 在LDA 中,每个文档都被视为各种主题的混合体。也就是说,对于每个文档(推文),我们得到推文属于每个主题的概率。概率总和为 1。
同样,每个主题都被视为各种术语(单词)的混合体。也就是说,对于每个主题,我们得到每个单词属于该主题的概率。概率总和为 1。
因此,对于每个单词主题组合,都分配了一个概率。代码terms(om1) 获取每个主题中概率最高的单词。
因此,在您的情况下,您会发现同一个词在多个主题中的概率最高。这不是错误。
以下代码将创建 TopicTermdf 数据集,其中包含每个主题的所有单词的分布。查看数据集,将有助于您更好地理解。
以下代码基于以下LDA with topicmodels, how can I see which topics different documents belong to? 帖子。
代码:
# Reproducible data - From Coursera.org John Hopkins Data Science Specialization Capstone project, SwiftKey Challange dataset
tweets <- c("How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.",
"When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.",
"they've decided its more fun if I don't.",
"So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)",
"Words from a complete stranger! Made my birthday even better :)",
"First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!",
"i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing",
"I'm coo... Jus at work hella tired r u ever in cali",
"The new sundrop commercial ...hehe love at first sight",
"we need to reconnect THIS WEEK")
library(tm)
myCorpus <- Corpus(VectorSource(tweets))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
removeURL <- function(x) gsub("http[^[:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myStopwords <- c(stopwords('english'), "available", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
library('SnowballC')
myCorpus <- tm_map(myCorpus, stemDocument)
dtm<-DocumentTermMatrix(myCorpus)
library(RTextTools)
library(topicmodels)
om1<-LDA(dtm,3)
输出:
> # Get the top word for each topic
> terms(om1)
Topic 1 Topic 2 Topic 3
"youll" "cub" "anoth"
>
> #Top word for each topic
> colnames(TopicTermdf)[apply(TopicTermdf,1,which.max)]
[1] "youll" "cub" "anoth"
>