【发布时间】:2016-08-08 13:51:17
【问题描述】:
使用nltk 中的scikit-learn 来检查朴素贝叶斯分类器的准确性,我做错了什么?
...readFile definition not needed
#divide the data into training and testing sets
data = readFile('Data_test/')
training_set = list_nltk[:2000000]
testing_set = list_nltk[2000000:]
#applied Bag of words as a way to select and extract feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.split('\n'))
#apply tfd
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
#Train the data
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))
#now test the accuracy of the naive bayes classifier
test_data_features = count_vect.transform(testing_set)
X_new_tfidf = tf_transformer.transform(test_data_features)
predicted = clf.predict(X_new_tfidf)
print "%.3f" % nltk.classify.accuracy(clf, predicted)
问题是当我打印 nltk.classify.accuracy 时,它需要很长时间,我怀疑这是因为我做错了什么,但由于我没有出错,我无法弄清楚是什么错误.
【问题讨论】:
-
您确定它调用了准确度方法吗?你想预测什么?尝试添加一些打印以查看它停止的位置。您的分类器的拟合方法似乎很奇怪,它应该是
clf.fit(X,Y),其中 X 是(矢量化)文本,Y 是训练集的标签。
标签: python-2.7 scikit-learn nltk