【发布时间】:2021-04-14 13:46:58
【问题描述】:
在使用数据进行情绪分析时 -
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
数据集包含 25K 训练和测试数据(12.5 个正面评价和 12.5 个负面评价) 我不断得到 -
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
代码-
(需要的库和变量名分别初始化)
创建训练和测试数据 -
import glob
import os
import numpy as np
def load_texts_labels_from_folders(path, folders):
texts,labels = [],[]
for idx,label in enumerate(folders):
for fname in glob.glob(os.path.join(path, label, '*.*')):
texts.append(open(fname, 'r',encoding="utf8").read())
labels.append(idx)
# stored as np.int8 to save space
return texts, np.array(labels).astype(np.int8)
trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)
len(trn),len(trn_y),len(val),len(val_y)
len(trn_y[trn_y==1]),len(val_y[val_y==1])
np.unique(trn_y)
计数向量化 -
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()
#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), min_df=1,max_features=80000)
trn_term_doc
trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
len(w0)
vocab = loaded_vectorizer.get_feature_names()
print(len(vocab))
vocab[5000:5005]
这里我得到错误 -
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
【问题讨论】:
标签: python scikit-learn nlp sentiment-analysis countvectorizer