【发布时间】:2019-05-23 15:14:11
【问题描述】:
编辑
训练语料库是我在此步骤之前构建的 Spark 数据框。我从 parquet 格式加载它并创建了一个“Feed”类,为 Gensim lib 提供火车语料库上的迭代器:
class Feed():
def __init__(self, train_data):
self.train_data = train_data
def __iter__(self):
for row in self.train_data.rdd.toLocalIterator():
yield \
gensim.models.doc2vec.TaggedDocument(\
words=[kw.lower() for kw in row["keywords"]] + list(row["tokens_filtered"]),\
tags=[row["id"]])
sdf = spark.read.parquet(save_dirname)
train_corpus = Feed(sdf)
结束编辑
我希望在大约 900 万个新闻文本文档上训练一个 Gensim Doc2Vec 模型。这是我的模型定义:
model = gensim.models.doc2vec.Doc2Vec(
workers=8,
vector_size=300,
min_count=50,
epochs=10)
第一步是获取词汇:
model.build_vocab(train_corpus)
90 分钟后结束。这是此过程结束时的日志记录信息:
INFO:gensim.models.doc2vec:collected 4202859 word types and 8950263 unique tags from a corpus of 8950339 examples and 1565845381 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=50 retains 325027 unique words (7% of original 4202859, drops 3877832)
INFO:gensim.models.word2vec:min_count=50 leaves 1546772183 word corpus (98% of original 1565845381, drops 19073198)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 4202859 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 9 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 1536820314 word corpus (99.4% of prior 1546772183)
INFO:gensim.models.base_any2vec:estimated required memory for 325027 words and 300 dimensions: 13472946500 bytes
然后我在训练语料库上使用迭代器类训练模型:
model.train(train_corpus, total_examples=nb_rows, epochs=model.epochs)
最后的训练日志是:
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 99.99% examples, 201921 words/s, in_qsize 16, out_qsize 0
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 7 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 6 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 5 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 4 more threads
但它永远不会完成剩余的线程。 这不是我第一次遇到这个问题,即使是小得多的火车语料库。通常,我会重新启动整个过程(词汇设置和模型训练),然后继续。
到现在为止,为了节省时间,我不想再次计算词汇表,将先前成功计算的词汇放在适当的位置,只尝试再次训练模型。有没有办法只保存模型的词汇部分,然后加载它以直接在训练语料库上训练模型?
【问题讨论】:
标签: python pyspark gensim doc2vec