使用 Gensim 在 Python 中重新训练预训练的词嵌入答案

【问题标题】：Retraining pre-trained word embeddings in Python using Gensim使用 Gensim 在 Python 中重新训练预训练的词嵌入
【发布时间】：2019-10-27 23:46:15
【问题描述】：

我想使用 Gensim 在 Python 中重新训练预训练的词嵌入。我要使用的预训练嵌入是 GoogleNews-vectors-negative300.bin 文件中的 Google 的 Word2Vec。

按照 Gensim 的 word2vec 教程，“无法使用 C 工具 load_word2vec_format() 生成的模型恢复训练。您仍然可以将它们用于查询/相似性，但那里缺少对训练至关重要的信息（词汇树）。” 因此我不能使用 KeyedVectors 和教程建议使用的模型训练：

    model = gensim.models.Word2Vec.load('/tmp/mymodel')
    model.train(more_sentences)

(https://rare-technologies.com/word2vec-tutorial/)

但是，当我尝试这个时：

from gensim.models import Word2Vec
model = Word2Vec.load('data/GoogleNews-vectors-negative300.bin')

我收到一条错误消息：

    1330         # Because of loading from S3 load can't be used (missing readline in smart_open)
    1331         if sys.version_info > (3, 0):
    -> 1332             return _pickle.load(f, encoding='latin1')
    1333         else:
    1334             return _pickle.loads(f.read())

    UnpicklingError: invalid load key, '3'.

我没有找到将二进制谷歌新文件正确转换为文本文件的方法，即使这样我也不确定这是否能解决我的问题。

是否有人对此问题有解决方案或知道重新训练预训练词嵌入的不同方法？

【问题讨论】：

标签： python-3.x gensim word2vec

【解决方案1】：

Word2Vec.load() 方法只能以 gensim 的本机格式（基于 Python object-pickling）加载完整模型——不能加载任何其他二进制/文本格式。

并且，根据文档中的说明“无法使用 C 工具生成的模型恢复训练”，GoogleNews raw-vectors 文件中没有足够的信息来重建所使用的完整工作模型训练他们。（这将需要一些内部模型权重，不保存在该文件中，以及用于控制采样的词频信息，也不保存在该文件中。）

您可以做的最好的事情是创建一个新的Word2Vec 模型，然后在进行自己的训练之前将部分/全部GoogleNews 向量修补到其中。这是一个容易出错的过程，没有真正的最佳实践，并且对最终结果的解释有许多警告。（例如，如果你引入了所有的向量，然后只使用你自己的语料库和词频重新训练一个子集，你做的训练越多——使词向量更适合你的语料库——这样的重新训练就越少。 - 训练过的单词与保留的未训练过的单词有任何有用的可比性。）

基本上，如果您可以查看 gensim Word2Vec 源并找出如何将这样的科学怪人模型拼凑在一起，那可能是合适的。但是没有内置的支持或方便的现成食谱可以使它变得容易，因为它本质上是一个模糊的过程。

【讨论】：

【解决方案2】：

我已经回答了here。

使用gensim将谷歌新闻模型保存为wor2vec格式的文本文件。

参考此answer 将其保存为文本文件
然后试试这段代码。

import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator

os.mkdir("model_dir")

# class EpochSaver(CallbackAny2Vec):
#     '''Callback to save model after each epoch.'''
#     def __init__(self, path_prefix):
#         self.path_prefix = path_prefix
#         self.epoch = 0

#     def on_epoch_end(self, model):
#         list_of_existing_files = os.listdir(".")
#         output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
#         try:
#             model.save(output_path)
#         except:
#             model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
#         print("number of epochs completed = {}".format(self.epoch))
#         self.epoch += 1
#         list_of_total_files = os.listdir(".")

# saver = EpochSaver("my_finetuned")





# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path,  limit=None):
    embed_shape = (len(token2id), 300)
    freqs = np.zeros((len(token2id)), dtype='f')

    vectors = np.zeros(embed_shape, dtype='f')
    i = 0
    with open(path, encoding="utf8", errors='ignore') as f:
        for o in f:
            token, *vector = o.split(' ')
            token = str.lower(token)
            if len(o) <= 100:
                continue
            if limit is not None and i > limit:
                break
            vectors[token2id[token]] = np.array(vector, 'f')
            i += 1

    return vectors


# path of text file of your word vectors.
embedding_name = "word2vec.txt"
data = "<training data(new line separated tect file)>"

# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}

# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}

# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
    words = line.strip().split(" ")
    training_examples.append(words)
    for word in words:
        if word not in vocab_freq_dict:
            vocab_freq_dict.update({word:0})
        vocab_freq_dict[word] += 1
        if word not in token2id:
            token2id.update({word:id_})
            id_ += 1

# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
    for o in f:
        token, *vector = o.split(' ')
        token = str.lower(token)
        if len(o) <= 100:
            continue
        if token not in token2id:
            max_token_id += 1
            token2id.update({token:max_token_id})
            vocab_freq_dict.update({token:1})

with open("vocab_freq_dict","wb") as vocab_file:
    pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
    pickle.dump(token2id, token2id_file)



# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)

# setting vectors(numpy_array) to None to release memory
vectors = None

params = dict(min_count=1,workers=14,iter=6,size=300)

model = Word2Vec(**params)

# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)

# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])

# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]

# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]


model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)

【讨论】：