如何通过gensim删除LDA分析中的单词答案

【问题标题】：How to remove a word in LDA analysis by gensim如何通过gensim删除LDA分析中的单词
【发布时间】：2018-09-06 23:54:22
【问题描述】：

我正在使用 gensim 进行 LDA 主题建模工作。我的数据被其他人预处理过。他给了我两件事。 ①mmcorpus文件（由gensim.corpora.MmCorpus函数导入） ②字典文件（由gensim.corpora.Dictionary.load函数导入）我成功创建了 LDA 模型，并将超参数 ALPHA 从 0.5 调整到 1.5，并绘制了如下可视化图表：我很困惑为什么那里有几个高大的酒吧。我发现一些奇怪的词是这样的：有趣的是出现了我以前从未见过的字母“b”。给我数据的人说，当他将数据转换为字节类型时，可能会自动生成字母“b”。他不知道如何删除“b”，我也不知道。当我只有mmcorpus文件和字典文件时，如何删除“b”？请！

【问题讨论】：

标签： python text-mining gensim lda stop-words

【解决方案1】：

gensim 具有从字典中过滤掉特定标记的功能。你只需要知道他们对应的ID。至于语料库，我不知道有任何内置函数可以让您修改其内容。但是，您可以将（通常是稀疏的）语料库转换为密集的 numpy 数组，删除一列并将其转换回 MmCorpus 格式。之后，您应该能够使用修改后的字典和语料库来训练新的 LDA 模型，这次没有不需要的单词。这是我用一个小型玩具语料库拍摄的：

import gensim
import numpy as np

# toy document set
texts = ['This is my first b', 'Another b just like so']
tokenlist = [list(gensim.utils.tokenize(text)) for text in texts]

# create dictionary and MmCorpus
dictionary = gensim.corpora.Dictionary(tokenlist)
corpus = [dictionary.doc2bow(tokens) for tokens in tokenlist]
gensim.corpora.MmCorpus.serialize('MmCorpusTest.mm', corpus)

# assume the word 'b' is to be deleted, put its id in a variable
del_ids = [k for k,v in dictionary.items() if v=='b']

# remove unwanted word ids from the dictionary in place
dictionary.filter_tokens(bad_ids=del_ids)

# load corpus from your file
corpusMm = gensim.corpora.MmCorpus('MmCorpusTest.mm')
# convert corpus to a dense array, transpose because by default documents would be columns
np_corpus = gensim.matutils.corpus2dense(corpusMm, corpusMm.num_terms, num_docs=corpusMm.num_docs).T
# delete columns for specified tokens, transpose back afterwards
np_corpus = np.delete(np_corpus, del_ids, 1).T
# convert array to corpus
new_corpus = gensim.matutils.Dense2Corpus(np_corpus)

【讨论】：