【发布时间】:2019-01-12 15:30:57
【问题描述】:
我正在使用 LDA 来了解精彩文本的主题。我设法打印了主题,但我想用您的主题打印每个文本。
数据:
it's very hot outside summer
there are not many flowers in winter
in the winter we eat hot food
in the summer we go to the sea
in winter we used many clothes
in summer we are on vacation
winter and summer are two seasons of the year
我尝试使用 sklearn 并且可以打印主题,但我想打印属于每个主题的所有短语
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import pandas
dataset = pandas.read_csv('data.csv', encoding = 'utf-8')
comments = dataset['comments']
comments_list = comments.values.tolist()
vect = CountVectorizer()
X = vect.fit_transform(comments_list)
lda = LatentDirichletAllocation(n_topics = 2, learning_method = "batch", max_iter = 25, random_state = 0)
document_topics = lda.fit_transform(X)
sorting = np.argsort(lda.components_, axis = 1)[:, ::-1]
feature_names = np.array(vect.get_feature_names())
docs = np.argsort(comments_list[:, 1])[::-1]
for i in docs[:4]:
print(' '.join(i) + '\n')
良好的输出:
Topic 1
it's very hot outside summer
in the summer we go to the sea
in summer we are on vacation
winter and summer are two seasons of the year
Topic 2
there are not many flowers in winter
in the winter we eat hot food
in winter we used many clothes
winter and summer are two seasons of the year
【问题讨论】:
-
您已经获得了文档,并且对于每个文档,都有 document_topic。因此,只需遍历您的 document_topics 变量并使用字典存储主题和索引。
-
谢谢@Norhther,所以我应该这样做:for i in document_topics?
-
document_topics 为您的每个文档都有一个主题。所以你可以用for来做,存储索引。列表字典可以完成这项工作,列表存储索引,键是主题。
-
对不起,如果我理解正确的话,我不能这样做。我希望输出带有文本形式的文档及其主题。如果我按照你说的做,我将有一个数字形式的文档及其主题:(
标签: python python-3.x scikit-learn lda topic-modeling