NLTK 朴素贝叶斯分类器训练问题答案

【问题标题】：NLTK Naive Bayes Classifier Training issuesNLTK 朴素贝叶斯分类器训练问题
【发布时间】：2017-08-30 04:58:39
【问题描述】：

我正在尝试为推文训练分类器。然而，问题在于它说分类器具有 100% 的准确度，而信息量最大的特征列表没有显示任何内容。有谁知道我做错了什么？我相信我对分类器的所有输入都是正确的，所以我不知道哪里出了问题。

这是我正在使用的数据集： http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

这是我的代码：

import nltk
import random

file = open('Train/train.txt', 'r')


documents = []
all_words = []           #TODO remove punctuation?
INPUT_TWEETS = 3000

print("Preprocessing...")
for line in (file):

    # Tokenize Tweet content
    tweet_words = nltk.word_tokenize(line[2:])

    sentiment = ""
    if line[0] == 0:
        sentiment = "negative"
    else:
        sentiment = "positive"
    documents.append((tweet_words, sentiment))

    for word in tweet_words:
        all_words.append(word.lower())

    INPUT_TWEETS = INPUT_TWEETS - 1
    if INPUT_TWEETS == 0:
        break

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]   #top 3000 words

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#Categorize as positive or Negative
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]


training_set = feature_set[:1000]
testing_set = feature_set[1000:]  

print("Training...")
classifier = nltk.NaiveBayesClassifier.train(training_set)

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)

【问题讨论】：

看起来问题在于将line[0] 处的字符与int 0 进行比较。我怀疑您的输入实际上使用空字节来表示负面情绪。

标签： python nltk sentiment-analysis naivebayes nltk-trainer

【解决方案1】：

您的代码中有错字：

feature_set = [(find_features(all_words), mood) for (all_words, sentment) in documents]

这会导致sentiment 始终具有相同的值（即预处理步骤中最后一条推文的值），因此训练毫无意义，所有特征都无关紧要。

修复它，你会得到：

('Naive Bayes Accuracy:', 66.75)
Most Informative Features
                  -- = True           positi : negati =      6.9 : 1.0
               these = True           positi : negati =      5.6 : 1.0
                face = True           positi : negati =      5.6 : 1.0
                 saw = True           positi : negati =      5.6 : 1.0
                   ] = True           positi : negati =      4.4 : 1.0
               later = True           positi : negati =      4.4 : 1.0
                love = True           positi : negati =      4.1 : 1.0
                  ta = True           positi : negati =      4.0 : 1.0
               quite = True           positi : negati =      4.0 : 1.0
              trying = True           positi : negati =      4.0 : 1.0
               small = True           positi : negati =      4.0 : 1.0
                 thx = True           positi : negati =      4.0 : 1.0
               music = True           positi : negati =      4.0 : 1.0
                   p = True           positi : negati =      4.0 : 1.0
             husband = True           positi : negati =      4.0 : 1.0

【讨论】：

我改了错字，但我的输出没有改变它仍然是 100% 并且没有显示功能
那么您的 train.txt 可能已损坏/不完整？我使用 df = pd.read_csv('Sentiment Analysis Dataset.csv', error_bad_lines=False, encoding='utf-8') 将原始数据读入 DataFrame 并使用 df.iterrows() 遍历行以获取上面粘贴的输出。