Sklearn NotFittedError 用于管道中的 CountVectorizer答案

【问题标题】：Sklearn NotFittedError for CountVectorizer in pipelineSklearn NotFittedError 用于管道中的 CountVectorizer
【发布时间】：2019-01-17 05:58:13
【问题描述】：

我正在尝试学习如何通过 sklearn 处理文本数据，但遇到了一个我无法解决的问题。

我关注的教程是：http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

输入是一个带有两列的 pandas df。一种是文本，一种是二进制类。

代码：

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])

x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']

# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)


# TF-IDF
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)

# MNB

mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)

text_clf = Pipeline([('vect', CountVectorizer()),
             ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
                ])

predicted = text_clf.predict(x_test_modified)

当我尝试运行最后一行时：

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    113 
    114         # lambda, but not partial, allows help() to work with update_wrapper
--> 115         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    116         # update the docstring of the returned function
    117         update_wrapper(out, self.fn)

~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
    304         for name, transform in self.steps[:-1]:
    305             if transform is not None:
--> 306                 Xt = transform.transform(Xt)
    307         return self.steps[-1][-1].predict(Xt)
    308 

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
    918             self._validate_vocabulary()
    919 
--> 920         self._check_vocabulary()
    921 
    922         # use the same matrix-building strategy as fit_transform

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
    301         """Check if vocabulary is empty or missing (not fit-ed)"""
    302         msg = "%(name)s - Vocabulary wasn't fitted."
--> 303         check_is_fitted(self, 'vocabulary_', msg=msg),
    304 
    305         if len(self.vocabulary_) == 0:

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    766 
    767     if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768         raise NotFittedError(msg % {'name': type(estimator).__name__})
    769 
    770 

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

有关如何修复此错误的任何建议？我正在根据测试数据正确转换 CV 模型。我什至检查了词汇表是否为空而不是（count_vect.vocabulary_）

谢谢！

【问题讨论】：

标签： machine-learning scikit-learn nlp tf-idf countvectorizer

【解决方案1】：

您的问题有几个问题。

对于初学者，您实际上并不适合管道，因此会出现错误。仔细观察linked tutorial，您会发现有一个步骤text_clf.fit（其中text_clf 确实是管道）。

其次，你没有正确使用管道的概念，这正是为了适应端到端的整个东西；取而代之的是，您将它的各个组件一个一个地拟合...如果您再次查看本教程，您会看到 管道的代码适合：

text_clf.fit(twenty_train.data, twenty_train.target)

使用初始形式的数据，不是它们的中间转换，就像你做的那样；本教程的重点是演示如何将各个转换封装在管道中（并被管道替换），不在这些转换之上使用管道...

第三，你应该避免将变量命名为fit——这是一个保留关键字；同样，我们不使用 CV 来缩写 Count Vectorizer（在 ML 术语中，CV 代表交叉验证）。

也就是说，这是使用管道的正确方法：

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])

x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB()),
                     ])

text_clf.fit(x_train, y_train) 

predicted = text_clf.predict(x_test)

如您所见，管道的目的是让事情变得更简单（与依次使用组件相比），而不是让它们进一步复杂化......

【讨论】：

感谢您的解释。绝对需要先阅读 Pipeline 文档！