在 Gensim LDA 中记录主题分布答案

【问题标题】：Document topical distribution in Gensim LDA在 Gensim LDA 中记录主题分布
【发布时间】：2013-06-23 01:16:06
【问题描述】：

我使用玩具语料库导出了一个 LDA 主题模型，如下所示：

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

我发现，当我使用少量主题来推导模型时，Gensim 会生成一份完整的主题分布报告，其中包含测试文档的所有潜在主题的主题分布。例如：

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

但是当我使用大量主题时，报告不再完整：

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

在我看来，概率小于某个阈值（我观察到 0.01 更具体）的主题在输出中被省略了。

我想知道这种行为是否是出于某种审美考虑？以及如何获得所有其他主题的概率质量残差分布？

感谢您的友好回答！

【问题讨论】：

可能其他主题的百分比太低而无法视为突出。
我遇到了同样的问题。你找到解决办法了吗？
你能说出你是如何创建“语料库”的吗

标签： python lda gensim

【解决方案1】：

我意识到这是一个老问题，但如果有人偶然发现，这里有一个解决方案（问题实际上是 fixed in the current development branch 与 minimum_probability 参数到 LdaModel 但也许你正在运行旧版本gensim)。

定义一个新函数（这只是从源代码复制而来）

def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]

上述函数不会根据概率过滤输出主题，而是会输出所有主题。如果您不需要 (topic_id, value) 元组而只需要值，则只需返回 topic_dist 而不是列表解析（它也会快得多）。

【讨论】：

嗨，伽玛是主题的概率分布吗？抱歉，如果这听起来很傻，我对 LDA 的内部结构不是很熟悉。因为文档中写道：“给定一大块稀疏的文档向量，估计该块中每个文档的 gamma（控制主题权重的参数）。” .我认为 gensim 提供了一个生成器 (lda[corpus])。
gamma 是每个文档的非标准化主题分数，topic_dist 是概率分布。是的gensim 提供了一个生成器lda[corpus]，该生成器在内部使用lda.inference。正如我上面所说，如果您不需要 需要(topic_id, probability) 对，那么自己调用.inference 会更快。如果您的语料库非常大并且不适合内存，您可能需要执行分块，lda[corpus] 也会在内部进行分块。
NB 使用以下方法对所有主题的分布进行归一化，而不仅仅是第一个 topic_dist = gamma / gamma.sum(axis=1)[:, None]

【解决方案2】：

阅读source，发现概率小于阈值的主题会被忽略。此阈值的默认值为 0.01。

【讨论】：