【发布时间】:2019-11-25 13:17:46
【问题描述】:
程序应该返回列表中最相似的第二个文本,因为它是同一个词。但这里不是这样。
import gensim
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data=[TaggedDocument(word_tokenize(_d.lower()),tags=[str(i)]) for i,_d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
negative=0,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
#print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
loaded_model=Doc2Vec.load("d2v.model")
test_data=["I love coding in python".lower()]
v1=loaded_model.infer_vector(test_data)
similar_doc=loaded_model.docvecs.most_similar([v1])
print similar_doc
输出:
[('0', 0.17585766315460205), ('2', 0.055697083473205566), ('3', -0.02361609786748886), ('1', -0.2507985532283783)]
它将列表中的第一个文本显示为最相似,而不是第二个文本。你能帮忙吗?
【问题讨论】:
标签: python text-classification doc2vec