【发布时间】:2022-01-14 15:07:30
【问题描述】:
在使用 LDA 模型后,我正在尝试为我的数据集中的文档提取主题分数。具体来说,我遵循了这里的大部分代码:https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
我已经完成了主题模型并得到了我想要的结果,但是提供的代码只给出了每个文档的最主要主题。有没有一种简单的方法来修改以下代码,以便给我说 5 个最主要的主题的分数?
##dominant topic for each document
def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row in enumerate(ldamodel[corpus]):
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
# Show
df_dominant_topic.head(10)
【问题讨论】:
标签: python gensim lda topic-modeling