doc2vec/gensim - 时代中的洗牌问题答案

【问题标题】：doc2vec/gensim - issue with shuffling sentences in the epochsdoc2vec/gensim - 时代中的洗牌问题
【发布时间】：2023-10-24 07:42:01
【问题描述】：

我正在尝试使用出色的教程here 和here 开始使用word2vec 和doc2vec，并尝试使用代码示例。我只添加了一个line_clean() 方法来删除标点符号、停用词等。

但是我在训练迭代中调用的line_clean() 方法遇到了问题。我知道对全局方法的调用搞砸了，但我不确定如何解决这个问题。

Iteration 1
Traceback (most recent call last):
  File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 96, in <module>
    train()
  File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 91, in train
    model.train(sentences.sentences_perm(),total_examples=model.corpus_count,epochs=model.iter)
  File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 61, in sentences_perm
    shuffled = list(self.sentences)
AttributeError: 'TaggedLineSentence' object has no attribute 'sentences'

我的代码如下：

import gensim
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
import os
import random
import numpy
from sklearn.linear_model import LogisticRegression
import logging
import sys
from nltk import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))


def clean_line(line):
    new_str = unicode(line, errors='replace').lower() #encoding issues
    dlist = tokenizer.tokenize(new_str)
    dlist = list(set(dlist).difference(stopword_set))
    new_line = ' '.join(dlist)
    return new_line


class TaggedLineSentence(object):
    def __init__(self, sources):
        self.sources = sources

        flipped = {}

        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')

    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield TaggedDocument(utils.to_unicode(clean_line(line)).split(), [prefix + '_%s' % item_no])

    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(TaggedDocument(utils.to_unicode(clean_line(line)).split(), [prefix + '_%s' % item_no]))
        return(self.sentences)

    def sentences_perm(self):
        shuffled = list(self.sentences)
        random.shuffle(shuffled)
        return(shuffled)


def train():
    #create a list data that stores the content of all text files in order of their names in docLabels
    doc_files = [f for f in os.listdir('./data/') if f.endswith('.csv')]

    sources = {}
    for doc in doc_files:
        doc2 = os.path.join('./data',doc)
        sources[doc2] = doc.replace('.csv','')

    sentences = TaggedLineSentence(sources)


    # #iterator returned over all documents
    model = gensim.models.Doc2Vec(size=300, min_count=2, alpha=0.025, min_alpha=0.025)
    model.build_vocab(sentences)

    #training of model
    for epoch in range(10):
        #random.shuffle(sentences)
        print 'iteration '+str(epoch+1)
        #model.train(it)
        model.alpha -= 0.002
        model.min_alpha = model.alpha
        model.train(sentences.sentences_perm(),total_examples=model.corpus_count,epochs=model.iter)
    #saving the created model
    model.save('reddit.doc2vec')
    print "model saved" 

train()

【问题讨论】：

标签： python word2vec gensim doc2vec

【解决方案1】：

对于gensim 的最新版本，这些不是很好的教程。特别是，在循环中多次调用train() 并使用您自己的alpha/min_alpha 手动管理是一个坏主意。很容易搞砸——例如，你的代码中会发生错误的事情——并且对大多数用户没有任何好处。不要从默认值更改min_alpha，并且只调用一次train() - 然后它将完全执行epochs 迭代，正确地将学习率alpha 从其最大值衰减到最小值。

您的具体错误是因为您的 TaggedLineSentence 类没有 sentences 属性 - 至少在调用 to_array() 之后 - 但是代码正在尝试访问该不存在的属性。

整个to_array()/sentences_perm() 方法有点破。使用这种可迭代类的原因通常是将大型数据集保留在主内存之外，将其从磁盘流式传输。但是to_array() 然后只是加载所有内容，将其缓存在类内 - 消除了可迭代的好处。如果你负担得起，因为完整的数据集很容易放入内存中，你可以这样做......

sentences = list(TaggedLineSentence(sources)

...从磁盘迭代一次，然后将语料库保存在内存列表中。

通常不需要在训练期间反复洗牌。只有当训练数据有一些现有的聚类时——比如所有带有特定单词/主题的示例都粘在排序的顶部或底部——本地排序才可能导致训练问题。在这种情况下，在任何训练之前进行一次 shuffle 就足以消除结块。所以再次假设您的数据适合内存，您可以这样做......

sentences = random.shuffle(list(TaggedLineSentence(sources)

...一次，那么您就有了一个sentences，可以在下面的build_vocab() 和train()（一次）中传递给Doc2Vec。

【讨论】：