Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'答案

【问题标题】：Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'
【发布时间】：2016-12-19 13:07:29
【问题描述】：

我正在从gensim 库中学习Doc2Vec 模型，并按如下方式使用它：

class MyTaggedDocument(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin:
                print(fname)
                for item_no, sentence in enumerate(fin):
                    yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no])
sentences = MyTaggedDocument(dirname)
model = Doc2Vec(sentences,min_count=2, window=10, size=300, sample=1e-4, negative=5, workers=7)

输入dirname 是一个目录路径，为了简单起见，它只有2 个文件，每个文件包含100 多行。我正在关注异常。

此外，使用print 语句，我可以看到迭代器在目录上迭代了 6 次。为什么会这样？

我们将不胜感激。

【问题讨论】：

一件事，如果 w 不在停用词中，你不想要吗？现在你的句子只包含停用词
是的，这是一个错误，我已更正它，但仍然存在同样的问题。
medium.com/@gofortargets/…
medium.com/@gofortargets/…

标签： python neural-network gensim word2vec doc2vec

【解决方案1】：

它看起来像一个文本示例对象，它的形状应该像 TaggedDocument（具有 words 和 tags 属性，以前称为 LabeledSentence），但不知何故是一个普通字符串。您是否 100% 确定屏幕截图中的错误是由您包含的可迭代代码生成的？（这里的代码看起来只能发出可接受的LabeledSentece 对象。）

您提供的语料库 Iterable 被读取一次以进行初始扫描以发现所有单词/标签，然后再次多次进行训练。多少次由iter 参数控制，默认值（在最新版本的 gensim 中）为 5。因此初始扫描加上 5 次训练通过等于 6 次总迭代。（10 次或更多的迭代在 Doc2Vec 中很常见。）

【讨论】：