【问题标题】:How to save classifier in sklearn with Countvectorizer() and TfidfTransformer()如何使用 Countvectorizer() 和 TfidfTransformer() 在 sklearn 中保存分类器
【发布时间】:2020-01-21 00:48:08
【问题描述】:

下面是分类器的一些代码。我使用pickle 来保存和加载这个page 中指示的分类器。但是,当我加载它使用它时,我无法使用CountVectorizer()TfidfTransformer() 将原始文本转换为分类器可以使用的向量。

我唯一能够让它工作的是在训练分类器后立即分析文本,如下所示。

import os
import sklearn
from sklearn.datasets import load_files

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

from sklearn.feature_extraction.text import CountVectorizer
import nltk

import pandas
import pickle

class Classifier:

    def __init__(self):

        self.moviedir = os.getcwd() + '/txt_sentoken'

    def Training(self):

        # loading all files. 
        self.movie = load_files(self.moviedir, shuffle=True)


        # Split data into training and test sets
        docs_train, docs_test, y_train, y_test = train_test_split(self.movie.data, self.movie.target, 
                                                                  test_size = 0.20, random_state = 12)

        # initialize CountVectorizer
        self.movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000)

        # fit and tranform using training text 
        docs_train_counts = self.movieVzer.fit_transform(docs_train)


        # Convert raw frequency counts into TF-IDF values
        self.movieTfmer = TfidfTransformer()
        docs_train_tfidf = self.movieTfmer.fit_transform(docs_train_counts)

        # Using the fitted vectorizer and transformer, tranform the test data
        docs_test_counts = self.movieVzer.transform(docs_test)
        docs_test_tfidf = self.movieTfmer.transform(docs_test_counts)

        # Now ready to build a classifier. 
        # We will use Multinominal Naive Bayes as our model


        # Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
        self.clf = MultinomialNB()
        self.clf.fit(docs_train_tfidf, y_train)


        # save the model
        filename = 'finalized_model.pkl'
        pickle.dump(self.clf, open(filename, 'wb'))

        # Predict the Test set results, find accuracy
        y_pred = self.clf.predict(docs_test_tfidf)

        # Accuracy
        print(sklearn.metrics.accuracy_score(y_test, y_pred))

        self.Categorize()

    def Categorize(self):
        # very short and fake movie reviews
        reviews_new = ['This movie was excellent', 'Absolute joy ride', 'It is pretty good', 
                      'This was certainly a movie', 'I fell asleep halfway through', 
                      "We can't wait for the sequel!!", 'I cannot recommend this highly enough', 'What the hell is this shit?']

        reviews_new_counts = self.movieVzer.transform(reviews_new)         # turn text into count vector
        reviews_new_tfidf = self.movieTfmer.transform(reviews_new_counts)  # turn into tfidf vector


        # have classifier make a prediction
        pred = self.clf.predict(reviews_new_tfidf)

        # print out results
        for review, category in zip(reviews_new, pred):
            print('%r => %s' % (review, self.movie.target_names[category]))

【问题讨论】:

    标签: python-3.x scikit-learn


    【解决方案1】:

    在 MaximeKan 的建议下,我研究了一种方法来保存所有 3 个。

    保存模型和矢量化器

    import pickle
    
    with open(filename, 'wb') as fout:
        pickle.dump((movieVzer, movieTfmer, clf), fout)
    

    加载模型和矢量化器以供使用

    import pickle
    
    with open('finalized_model.pkl', 'rb') as f:
        movieVzer, movieTfmer, clf = pickle.load(f)
    

    【讨论】:

      【解决方案2】:

      发生这种情况是因为您不仅应该保存分类器,还应该保存矢量化器。否则,您将在看不见的数据上重新训练矢量化器,这些数据显然不会包含与训练数据完全相同的单词,并且维度会发生变化。这是一个问题,因为您的分类器期望提供某种输入格式。

      因此,您的问题的解决方案非常简单:您还应该将矢量化器保存为 pickle 文件,并在使用它们之前将它们与分类器一起加载。

      注意:为避免保存和加载两个对象,您可以考虑将它们放在一个 pipeline 中,这是等效的。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-07-12
        • 1970-01-01
        • 2021-02-09
        • 2014-03-03
        • 2018-03-20
        • 2020-07-15
        • 2017-03-10
        • 2021-07-17
        相关资源
        最近更新 更多