【发布时间】:2018-12-21 22:16:38
【问题描述】:
我一直在关注 SentDex 的 video series 关于 NLTK 和 Python,并构建了一个脚本,该脚本使用各种模型来确定评论情绪,例如逻辑回归。我担心的是,我认为 SentDex 的方法在确定用于训练的单词时包括测试集,这显然是不可取的(训练/测试拆分发生在特征选择之后)。
(根据 Mohammed Kashif 的 cmets 编辑)
完整代码:
import nltk
import numpy as np
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from nltk.corpus import movie_reviews
from sklearn.naive_bayes import MultinomialNB
documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(documents):
words = set(documents)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
np.random.shuffle(featuresets)
training_set = featuresets[:1800]
testing_set = featuresets[1800:]
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
已经试过了:
documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
np.random.shuffle(documents)
training_set = documents[:1800]
testing_set = documents[1800:]
all_words = []
for w in documents.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(training_set):
words = set(training_set)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in training_set]
np.random.shuffle(featuresets)
training_set = featuresets
testing_set = testing_set
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
产生错误:
Traceback(最近一次调用最后一次):
文件“”,第 34 行,在 print("MNB_classifier 精度:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
文件“C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\util.py”,第 87 行,准确无误 results = classifier.classify_many([fs for (fs, l) in gold])
文件“C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\scikitlearn.py”,第 85 行,在分类多 X = self._vectorizer.transform(featuresets)
文件“C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”,第 291 行,在转换中 return self._transform(X,fitting=False)
文件“C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”,第 166 行,在 _transform for f, v in Six.iteritems(x):
文件“C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\six.py”,第 439 行,在 iteritems 返回 iter(getattr(d, _iteritems)(**kw))
AttributeError: 'list' 对象没有属性 'items'
【问题讨论】:
-
更新了帖子以包含完整的代码和跟踪尝试的解决方案。如果您运行 nltk.download('all') 那么您应该能够按原样运行代码。还包括指向视频系列的链接。
标签: python machine-learning scikit-learn nlp nltk