scikit-learn 中的词汇拟合问题？答案

【问题标题】：Problems fitting vocabulary in scikit-learn?scikit-learn 中的词汇拟合问题？
【发布时间】：2015-02-22 06:29:33
【问题描述】：

我有一个充满.txt 文件（文档）的目录。首先我load文件，去掉一些括号，去掉一些引号，所以文件如下所示，例如：

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods

所以我从这样的目录中加载文件：

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")

然后我尝试对document1 和document2 进行矢量化，以创建如下训练矩阵：

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()

那么这是输出：

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

鉴于此，我如何创建矢量表示？我以为我在documents 中携带加载的文件，但似乎无法安装文件。

【问题讨论】：

标签： python machine-learning nlp scikit-learn

【解决方案1】：

documents的内容是什么？ It looks like 它应该是带有标记的文件名或字符串的列表。此外，您应该使用对象调用 fit_transform，而不是像静态方法，即。 e. vectorizer.fit_transform(documents).

例如，这在这里工作：

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

【讨论】：

感谢您的反馈，当我打印documents 时，我得到以下信息：[a very large text][another very large text][a third very large text]，三个列表代表我在目录中拥有的 3 个 .txt 文件，这里有什么建议吗？。
是的，您的documents 应该是一个列表，其中每个元素都是带有标记化文档的字符串。类似：documents=['word_1_doc_1 word_2_doc_1 word_3_doc_1', 'word_1_doc_2 word_2_doc_2 ...', 'word_1_doc_3, word_2_doc_3 ...']。如果您执行documents=[' '.join(ii) for ii in documents] 之类的操作，它可能会成功。