gensim LDA 模块：在预测时始终获得均匀的主题分布答案

【问题标题】：gensim LDA module : Always getting uniform topical distribution while predictinggensim LDA 模块：在预测时始终获得均匀的主题分布
【发布时间】：2017-03-14 09:20:26
【问题描述】：

我有一组文档，我想知道每个文档的主题分布（对于不同的主题数量值）。我从this question 中学习了一个玩具程序。我首先使用了 gensim 提供的 LDA，然后我再次将测试数据作为我的训练数据本身来获取每个文档在训练数据中的主题分布。但我总是得到统一的主题分布。

这是我使用的玩具代码

import gensim
import logging
logging.basicConfig(filename="logfile",format='%(message)s', level=logging.INFO)


def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gensim.corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word
mm = [dictionary.doc2bow(text) for text in texts]
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=2, update_every=1, chunksize=10000, passes=1,minimum_probability=0.0)

newdocs=["human system"]
print lda[dictionary.doc2bow(newdocs)]

newdocs=["Human machine interface for lab abc computer applications"] #same as 1st doc in training
print lda[dictionary.doc2bow(newdocs)]

这是输出：

[(0, 0.5), (1, 0.5)]
[(0, 0.5), (1, 0.5)]

我检查了更多示例，但最终都给出了相同的等概率结果。

这是生成的日志文件（即记录器的输出）

adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(42 unique tokens: [u'and', u'minors', u'generation', u'testing', u'iv']...) from 9 documents (total 69 corpus positions)
using symmetric alpha at 0.5
using symmetric eta at 0.5
using serial LDA version on this node
running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
-5.796 per-word bound, 55.6 perplexity estimate based on a held-out corpus of 9 documents with 69 words
PROGRESS: pass 0, at document #9/9
topic #0 (0.500): 0.057*"of" + 0.043*"user" + 0.041*"the" + 0.040*"trees" + 0.039*"interface" + 0.036*"graph" + 0.030*"system" + 0.027*"time" + 0.027*"response" + 0.026*"eps"
topic #1 (0.500): 0.088*"of" + 0.061*"system" + 0.043*"survey" + 0.040*"a" + 0.036*"graph" + 0.032*"trees" + 0.032*"and" + 0.032*"minors" + 0.031*"the" + 0.029*"computer"
topic diff=0.539396, rho=1.000000

它说“更新太少，训练可能不会收敛”，所以我尝试将通过次数增加到 1000，但输出仍然相同。（虽然和收敛无关，但我也试过增加话题数）

【问题讨论】：

标签： python lda gensim

【解决方案1】：

问题在于将变量newdocs 转换为gensim 文档。 dictionary.doc2bow() 确实需要一个列表，但需要一个单词列表。您提供了一个文档列表，因此它将“人类系统”解释为一个词但在训练集中没有这样的词，因此它会忽略它。为了让我的观点更清楚，请参阅以下代码的输出

import gensim
documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gensim.corpora.Dictionary(texts)

print dictionary.doc2bow("human system".split())
print dictionary.doc2bow(["human system"])
print dictionary.doc2bow(["human"])
print dictionary.doc2bow(["foo"])

所以要更正上面的代码，你所要做的就是根据以下内容更改newdocs

newdocs = "human system".lower().split()
newdocs = "Human machine interface for lab abc computer applications".lower().split()

哦，顺便说一句，你观察到的行为，得到相同的概率，只是空文档的主题分布，即均匀分布。

【讨论】：

完美！谢谢！而且我还需要知道一件事..我做所有这些的主要目标是，正如问题中提到的，得到主题的主题分布。在没有我在代码中使用的小技巧（将训练集作为测试集提供！）进行 LDA 之后，有没有更好的方法获得它！