【问题标题】:NotFittedError: CountVectorizer - Vocabulary wasn't fitted. while performing sentiment analysisNotFittedError:CountVectorizer - 未安装词汇。在进行情绪分析时
【发布时间】:2021-04-14 13:46:58
【问题描述】:

在使用数据进行情绪分析时 -

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

数据集包含 25K 训练和测试数据(12.5 个正面评价和 12.5 个负面评价) 我不断得到 -

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

代码-

(需要的库和变量名分别初始化)

创建训练和测试数据 -

import glob
import os
import numpy as np
def load_texts_labels_from_folders(path, folders):
    texts,labels = [],[]
    for idx,label in enumerate(folders):
        for fname in glob.glob(os.path.join(path, label, '*.*')):
            texts.append(open(fname, 'r',encoding="utf8").read())
            labels.append(idx)
    # stored as np.int8 to save space 
    return texts, np.array(labels).astype(np.int8)

trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)

len(trn),len(trn_y),len(val),len(val_y)

len(trn_y[trn_y==1]),len(val_y[val_y==1])

np.unique(trn_y)

计数向量化 -

re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)


trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), min_df=1,max_features=80000)
trn_term_doc
trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
len(w0)
vocab = loaded_vectorizer.get_feature_names()
print(len(vocab))
vocab[5000:5005]

这里我得到错误 -

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

【问题讨论】:

    标签: python scikit-learn nlp sentiment-analysis countvectorizer


    【解决方案1】:
    vocab = loaded_vectorizer.get_feature_names()
    

    loaded_vectorizer 没有在这段代码的任何地方定义,所以它没有被初始化也就不足为奇了。

    另外你为什么要初始化veczr 两次?显然你没有第二次使用它。

    【讨论】:

      猜你喜欢
      • 2020-08-01
      • 2011-05-10
      • 2017-11-06
      • 1970-01-01
      • 1970-01-01
      • 2010-11-14
      • 2018-11-28
      • 2015-09-23
      相关资源
      最近更新 更多