【发布时间】:2021-03-23 21:11:17
【问题描述】:
所以我开始尝试学习 Doc2Vec,特别是余弦相似度输出。基本上,当我尝试将一个新句子与我训练模型的句子列表匹配时,我得到了一个意想不到的输出。如果有人可以提供帮助,那就太棒了,这是我的代码:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import word_tokenize
data = [
'I love machine learning'
,'I love coding in python'
,'I love building chatbots'
,'they chat amazingly well'
,'dog poops in my yard'
,'this is a stupid exercise'
,'I like math and statistics'
,'cox communications is a dumb face'
,'Machine learning in python is difficult'
]
tagged_data = [TaggedDocument(words = word_tokenize(d.lower()), tags = [str(i)]) for i, d in enumerate(data)]
max_epochs = 15
vec_size = 10
wndw = 2
alpha_num = 0.025
model = Doc2Vec(vector_size = vec_size
,window = wndw
,alpha = alpha_num
,min_alpha = 0.00025
,min_count = 1
,dm = 1)
model.build_vocab(tagged_data)
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, workers = 4, epochs = 100)
new_sent = 'machine learning in python is easy'.split(' ')
model.docvecs.most_similar(positive = [model.infer_vector(new_sent)])
我收到的输出是这样的(而且每次运行时也是随机的,所以我也不确定):
[('2', 0.4818369746208191),
('5', 0.4623863697052002),
('3', 0.4057881236076355),
('4', 0.3984462022781372),
('8', 0.2882154583930969),
('7', 0.27972114086151123),
('6', 0.23783418536186218),
('0', 0.11647315323352814),
('1', -0.12095103412866592)]
意思是模型说明“我喜欢用 python 编码”与“python 中的机器学习很容易”最相似,而我预计“python 中的机器学习很困难”最相似。至少我是这么理解的。
【问题讨论】: