【发布时间】:2017-12-21 07:02:40
【问题描述】:
我编写了一个简单的文档分类器,目前正在布朗语料库上对其进行测试。但是,我的准确率仍然很低(0.16)。我已经排除了停用词。关于如何提高分类器性能的任何其他想法?
import nltk, random
from nltk.corpus import brown, stopwords
documents = [(list(brown.words(fileid)), category)
for category in brown.categories()
for fileid in brown.fileids(category)]
random.shuffle(documents)
stop = set(stopwords.words('english'))
all_words = nltk.FreqDist(w.lower() for w in brown.words() if w in stop)
word_features = list(all_words.keys())[:3000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
【问题讨论】:
-
我认为代码版本有问题,在classifier = nltk...之前似乎有两行注释是必需的。顺便说一句,这不使用朴素贝叶斯分类器,而是使用决策树分类器,因此您可能应该更改标签和标题。
-
您并没有排除停用词,您只是将它们包括在内。更改:
all_words = nltk.FreqDist(w.lower for w in brown.words() if w in stop)到all_words = nltk.FreqDist(w.lower for w in brown.words() if w not in stop)
标签: python classification nltk naivebayes