Doc2vec：如何获取文档向量答案

【问题标题】：Doc2vec: How to get document vectorsDoc2vec：如何获取文档向量
【发布时间】：2019-06-04 18:17:50
【问题描述】：

如何使用 Doc2vec 获取两个文本文档的文档向量？我是新手，所以如果有人能指出我正确的方向/帮助我一些教程会很有帮助

我正在使用 gensim。

doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)

我明白了

AttributeError: 'list' 对象没有属性 'words'

每当我运行它时。

【问题讨论】：

标签： python gensim word2vec

【解决方案1】：

如果要训练 Doc2Vec 模型，您的数据集需要包含单词列表（类似于 Word2Vec 格式）和标签（文档 ID）。它还可以包含一些附加信息（有关更多信息，请参阅https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb）。

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

更新（如何在 epoch 中训练）：这个例子已经过时了，所以我删除了它。有关 epoch 训练的更多信息，请参阅 this answer 或 @gojomo 的评论。

【讨论】：

我真的很喜欢在这里使用namedtuple 的想法，但让我感到困惑的是是否有 doc2？看起来tags 是句子的id，而不是文档。 docs list 看起来好像里面可以有多个文档。
实际上doc1中有2个不同的文档（不是一个文档中的两个句子）。我不知道，为什么@bee2502 将其命名为doc1。但是，您可以从documents1=[doc.strip().split(" ") for doc in doc1 ] 行中猜到这一点
@LenkaVraná 很多坦克都得到了很好的答案:) 我们是否必须训练我们的 doc2vec 模型几个时代？如果是这样，我们如何处理上面的例子？
@Volka：标签是一个列表（在这种情况下是整数列表，在您的情况下是字符串列表，但始终是列表）。
几乎每个人都应该不尝试自己管理alpha，并且不在自己的循环中多次调用train()。相反，使用所需的epochs 参数调用一次train()。它将平稳地管理学习率alpha，从其起始值到最终值，跨越数据的所有重复传递。

【解决方案2】：

Gensim 已更新。 LabeledSentence 的语法不包含标签。现在有标签 - 请参阅 LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

的文档

但是，@bee2502 是正确的

docvec = model.docvecs[99]

它应该是训练模型的第 100 个向量的值，它适用于整数和字符串。

【讨论】：

你为什么要给99
在示例中，我想要第 100 个向量，因此我调用了第 99 个索引，索引从 0 开始

【解决方案3】：

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

我得到 AttributeError: 'list' object has no attribute 'words' 因为 Doc2vec() 的输入文档的 LabeledSentence 格式不正确。我希望下面的这个例子能帮助你理解格式。

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])

更多详情在这里：http://rare-technologies.com/doc2vec-tutorial/ 但是，我通过使用 TaggedLineDocument() 从文件中获取输入数据解决了这个问题。
文件格式：一份文档 = 一行 = 一个 TaggedDocument 对象。单词应该已经被预处理并用空格分隔，标签是从文档行号自动构建的。

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

获取文档向量：您可以使用 docvecs。更多详情：https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

docvec = model.docvecs[99]

其中 99 是我们想要的向量的文档 ID。如果标签是整数格式（默认情况下，如果您使用 TaggedLineDocument() 加载），请像我一样直接使用整数 id。如果标签是字符串格式，请使用 "SENT_99" 。这类似于 Word2vec

【讨论】：

只是为了确认，在训练 model_dm 和 model_dbow 后，如教程 (linanqiu.github.io/2015/05/20/word2vec-sentiment) 所示，我正在使用 model_dm.docvecs['TRAIN_0'] 获取第一个训练文档的文档向量。这是正确的吗？
是的，这是正确的，然后您可以将多个文档与距离函数等进行比较。
我的训练文档超过 5m，但是当我使用 docvec = model.docvecs[11] 时，它显示 11 是我们的轴 0 的边界，大小为 10。我检查了 docvecs 大小，只有10、本应超过5m
@Kun 老话题，但我遇到了同样的问题。解决方案是在创建 TaggedDocument 时传递一个列表。例如 TaggedDocument(words, ["label_1"]) 否则它将每个字母作为标签。

【解决方案4】：

from gensim.models.doc2vec import Doc2Vec, TaggedDocument 
Documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(doc1)]
Model = Doc2Vec(Documents, other parameters~~)

这应该可以正常工作。您需要标记您的文档以训练doc2vec 模型。

【讨论】：