【发布时间】:2014-03-09 14:25:43
【问题描述】:
我正在尝试从同一组 10,000 个文档中获取 10,000 个文档列表的相关文档。我正在使用两种算法进行测试:gensim lsi 和 gensim 相似度。两者都给出了可怕的结果。我该如何改进它?
from gensim import corpora, models, similarities
from nltk.corpus import stopwords
import re
def cleanword(word):
return re.sub(r'\W+', '', word).strip()
def create_corpus(documents):
# remove common words and tokenize
stoplist = stopwords.words('english')
stoplist.append('')
texts = [[cleanword(word) for word in document.lower().split() if cleanword(word) not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
def create_lsi(documents):
corp = create_corpus(documents)
# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=400)
# print the most contributing words (both positively and negatively) for each of the first ten topics
lsi.print_topics(10)
def create_sim_index(documents):
corp = create_corpus(documents)
index = similarities.Similarity('/tmp/tst', corp, num_features=12)
return index
【问题讨论】:
-
首先,您不能对纯粹的无监督统计方法(例如 LSI 或 LDA)抱有太多期望。尝试
tf-idf、余弦相似度、更强的停用词列表、其他聚类方法(例如k-means) -
不,方法很好。这只是复制粘贴代码的热点问题@alvas :)
-
@Radim 可以与 Solr/ElasticSearch 一起使用 gensim 吗?