【发布时间】:2018-02-19 03:17:17
【问题描述】:
from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
在上面的代码中,我使用余弦相似度技术比较了“这是一本关于汽车、恐龙和围栏的书”与“我喜欢汽车和鸟类”的相似程度。
这两个句子实际上有 1 个共同词,即“汽车”,但是当我运行代码时,我发现它们 100% 相似。这对我来说没有意义。
有人可以建议如何改进我的代码,以便我得到一个合理的数字吗?
【问题讨论】:
标签: python gensim cosine-similarity