词嵌入不提供词之间的预期关系答案

【问题标题】：Word-embedding does not provide expected relations between words词嵌入不提供词之间的预期关系
【发布时间】：2021-07-21 07:16:32
【问题描述】：

我正在尝试将单词嵌入训练到只有主题发生变化的重复句子列表中。我期望与主题相对应的生成向量在训练后提供强相关性，正如词嵌入所期望的那样。但是，主题向量之间的夹角并不总是大于主题与随机词之间的夹角。

Man   is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy   is going to write a very long novel that no one can read.

代码基于pytorch tutorial:

import torch
from torch import nn
import torch.nn.functional as F
import numpy as np

class EmbedTrainer(nn.Module):
    def __init__(self, d_vocab, d_embed, d_context):
        super(EmbedTrainer, self).__init__()
        self.embed = nn.Embedding(d_vocab, d_embed)
        self.fc_1 = nn.Linear(d_embed * d_context, 128)
        self.fc_2 = nn.Linear(128, d_vocab)

    def forward(self, x):
        x = self.embed(x).view((1, -1)) # flatten after embedding
        x = self.fc_2(F.relu(self.fc_1(x)))
        x = F.log_softmax(x, dim=1)
        return x

text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2

""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

losses = []
epochs = 10
for epoch in range(epochs):
    total_loss = 0
    for input, target in trigrams:
        tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
        target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
        model.zero_grad()
        log_prob = model(tok_ids)
        #if total_loss == 0: print("train ", log_prob, target_id)
        loss = loss_func(log_prob, target_id)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    print(total_loss)
    losses.append(total_loss)

embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
    embed_map[word] = model.embed.weight[tok_to_ids[word]]
    print(word, embed_map[word])

def angle(a, b):
    from numpy.linalg import norm
    a, b = a.detach().numpy(), b.detach().numpy()
    return np.dot(a, b) / norm(a) / norm(b)

print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))

【问题讨论】：

可能没有足够的数据。你的训练集大小是多少？如果您在 3 个句子上训练 128d 嵌入，这并不出人意料。

标签： python nlp pytorch word-embedding

【解决方案1】：

这很可能是训练规模。训练 128d 嵌入绝对是矫枉过正。来自google developers blog的经验法则：

为什么在我们的示例中嵌入向量大小为 3？好吧，下面的“公式”提供了关于嵌入维数的一般经验法则：
embedding_dimensions = number_of_categories**0.25

也就是说，嵌入向量维度应该是类别数的第 4 根。由于我们在本例中的词汇量为 81，因此建议的维度数为 3：
3 = 81**0.25

【讨论】：

谢谢，这是一个很好的等式。我记得也尝试使用 3 的嵌入大小，但没有看到太大的区别。损失减少了，我希望看到一些相关性，但什么也没有。多次运行，不同的结果没有显着相关性。

【解决方案2】：

我希望生成的与主题相对应的向量在训练后提供很强的相关性，正如词嵌入所期望的那样

我真的认为你不会只用 3 句话就可以达到那种结果，比如 10 个 epoch 中的 40 次迭代（加上你 40 次迭代中的大部分数据都是重复的）。

也许可以尝试从那里下载几个免费的数据集，或者尝试使用经过验证的模型（例如基因模型）您自己的数据。

我将为您提供训练 gensim 模型的代码，这样您就可以在另一个模型上测试您的数据集，看看问题是来自您的数据还是来自您的模型。我已经在包含数百万个句子的数据集上测试了类似的 gensim 模型，它的效果非常好，对于较小的数据集，您可能需要更改参数。

from gensim.models import Word2Vec
from multiprocessing import cpu_count


corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'

model = Word2Vec(corpus_file=corpus_path,
                size=vecSize,
                window=winSize,
                min_count=minCount,
                workers=numWorkers,
                iter=epochs,
                sg=skipGram)
model.save(modelName)

附：我认为在代码中使用关键字输入作为变量不是一个好主意。

【讨论】：