使用 Spark LDA 可视化主题答案

【问题标题】：Visualizing topics with Spark LDA使用 Spark LDA 可视化主题
【发布时间】：2017-10-29 06:05:04
【问题描述】：

我正在使用 pySpark ML LDA 库在 sklearn 的 20 个新闻组数据集上拟合主题模型。我正在对训练语料库进行标准标记化、停用词删除和 tf-idf 转换。最后，我可以获取主题并打印出单词索引及其权重：

topics = model.describeTopics()
topics.show()
+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[5456, 6894, 7878...|[0.03716766297248...|
|    1|[5179, 3810, 1545...|[0.12236370744240...|
|    2|[5653, 4248, 3655...|[1.90742686393836...|
...

但是，如何从术语索引映射到实际单词以可视化主题？我正在使用应用于字符串标记列表的 HashingTF 来派生术语索引。如何生成用于可视化主题的字典（从索引到单词的映射）？

【问题讨论】：

标签： apache-spark lda apache-spark-ml

【解决方案1】：

HashingTF 的替代方法是生成词汇表的 CountVectorizer：

count_vec = CountVectorizer(inputCol="tokens_filtered", outputCol="tf_features", vocabSize=num_features, minDF=2.0)
count_vec_model = count_vec.fit(newsgroups)  
newsgroups = count_vec_model.transform(newsgroups)
vocab = count_vec_model.vocabulary

给定一个词汇表作为单词列表，我们可以对其进行索引以可视化主题：

topics = model.describeTopics()   
topics_rdd = topics.rdd

topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocab[idx] for idx in idx_list])\
       .collect()

for idx, topic in enumerate(topics_words):
    print "topic: ", idx
    print "----------"
    for word in topic:
       print word
    print "----------"

【讨论】：

下一步可能是将主题列表、文档、count_vect_model、词汇输入 Gensim 的 Coherence 模型以获得连贯性分数。

【解决方案2】：

HashingTF 是不可逆的，也就是说，从一个单词的输出索引中你不能得到输入的单词。多个单词可能映射到相同的输出索引。您可以使用 CountVectorizer，这是一个类似但可逆的过程。

【讨论】：