为什么 doc2vec 给出不同且不可靠的结果？答案

【问题标题】：Why doc2vec is giving different and un-reliable results?为什么 doc2vec 给出不同且不可靠的结果？
【发布时间】：2020-07-08 14:48:50
【问题描述】：

我有一组 20 个小文档，其中讨论了一种特定类型的问题（训练数据）。现在我想从 10K 文档中找出那些讨论相同问题的文档。

出于我使用 doc2vec 实现的目的：

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
    
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)
for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
    
def doc2vec_score(s):
    s_list = tokenize_and_stem(s)
    v1 = model.infer_vector(s_list)
    similar_doc = model.docvecs.most_similar([v1])
    original_match = (X[int(similar_doc[0][0])])
    score = similar_doc[0][1]
    match = similar_doc[0][0]
    return score,match


final_data  = []

# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
    print(row['processed_description'])
    data = (doc2vec_score(row['processed_description']))
    L1=list(data)
    L1.append(row['Number'])
    final_data.append(L1)
     
with open('file_cosine_d2v.csv','w',newline='') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['score','match','INC_NUMBER'])
    for row in final_data:
        csv_out.writerow(row)

但是，我遇到了一个奇怪的问题，结果非常不可靠（即使没有丝毫匹配，得分也是 0.9），并且每次得分都在大幅度变化。我正在运行doc2vec_score 函数。有人可以帮我看看这里有什么问题吗？

【问题讨论】：

标签： machine-learning nlp gensim similarity doc2vec

【解决方案1】：

首先，尽量不要在自己的循环中使用多次调用 train 的反模式。

如果修复后仍有问题，请编辑您的问题以显示更正后的代码，以及您认为不可靠的输出的更清晰示例。

例如，显示实际的 doc-ID 和分数，并解释为什么您认为您正在测试的探测文档与返回的任何文档“完全不匹配”。

请注意，如果文档真正与训练文档完全不同，例如使用训练文档中没有的词，Doc2Vec 模型实际上不可能检测到那。当它为新文档推断向量时，所有未知单词都被忽略。因此，您将得到一个仅使用已知个单词的文档，并且它会返回与文档中该单词子集的最佳匹配。

更根本的是，Doc2Vec 模型实际上只是学习方法来对比训练集所展示的宇宙中的文档，通过它们的词的共现。如果呈现一个包含完全不同单词的文档，或者其频率/共现与以前看到的完全不同的单词，它的输出将基本上是随机的，与其他更典型的文档没有太多有意义的关系。（这可能会很近，也可能很远，因为在某种程度上，对“已知宇宙”的训练往往会填满整个可用空间。）

因此，如果您还想识别负面示例，您不会想要使用一个Doc2Vec 模型仅只训练您想要识别的正面示例。相反，包括所有类型，然后记住与某些进/出决策相关的子集 - 并将该子集用于下游比较，或使用多个子集来提供更正式的分类或聚类算法。

【讨论】：