【发布时间】:2019-10-17 21:58:58
【问题描述】:
我使用 gensim LDA 主题建模从语料库中获取相关主题。现在我想获取代表每个主题的前 20 个文档:在某个主题中概率最高的文档。我想用这种格式将它们保存在 CSV 文件中:4 列主题 ID、主题词、主题中每个词的概率、每个主题的前 20 个文档。
我已经尝试过 get_document_topics,我认为这是完成这项任务的最佳方法:
all_topics = lda_model.get_document_topics(corpus, minimum_probability=0.0, per_word_topics=False)
但我不确定如何获取最能代表主题的前 20 个文档并将它们添加到 CSV 文件中。
data_words_nostops = remove_stopwords(processed_docs)
# Create Dictionary
id2word = corpora.Dictionary(data_words_nostops)
# Create Corpus
texts = data_words_nostops
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
pprint(lda_model.print_topics())
#save csv
fn = "topic_terms5.csv"
if (os.path.isfile(fn)):
m = "a"
else:
m = "w"
num_topics=20
# save topic, term, prob data in the file
with open(fn, m, encoding="utf8", newline='') as csvfile:
fieldnames = ["topic_id", "term", "prob"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if (m == "w"):
writer.writeheader()
for topic_id in range(num_topics):
term_probs = lda_model.show_topic(topic_id, topn=6)
for term, prob in term_probs:
row = {}
row['topic_id'] = topic_id
row['prob'] = prob
row['term'] = term
writer.writerow(row)
预期结果:具有此格式的 CSV 文件:4 列主题 ID、主题词、每个词的概率、每个主题的前 20 个文档。
【问题讨论】:
标签: python csv gensim lda topic-modeling