【发布时间】:2018-06-28 08:40:54
【问题描述】:
这是我第一次使用 Python 中的 nltk NaiveBayesClassifier 构建情感分析机器学习模型。我知道模型太简单了,但这对我来说只是第一步,下次我会尝试标记化的句子。
我当前模型的真正问题是:我已在训练数据集中将“坏”一词明确标记为负面(从“negative_vocab”变量中可以看出)。然而,当我对列表 ['awesome movie', 'i like it', 'it is so bad'] 中的每个句子(小写)运行 NaiveBayesClassifier 时,分类器错误地将 'it is so bad' 标记为正面。
输入:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
输出:
awesome movie is pos
i like it is pos
it is so bad is pos
为了确保函数 'word_feat(word)' 迭代每个句子而不是每个单词或字母,我做了一些诊断代码来查看 'word_feat(word)' 中的每个元素是什么:
for word in words:
print(word_feat(word))
然后打印出来:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
所以看起来函数'word_feat(word)'是正确的?
有谁知道为什么分类器将“情况如此糟糕”分类为正面?如前所述,我在训练数据中明确将“坏”一词标记为负面。
【问题讨论】:
-
你能尝试一个中性词,看看输出是中性的还是积极的?
-
例如
breaking bad is really a good drama,bad -> neutral? -
这是一个统计模型,可能有很多事情会导致您可能不想要的输出,但它可能不会错。例如。预处理、数据偏差、退避策略等
-
您不能期望机器学习模型能够正确分类每个实例。您需要生成一些指标(例如准确性、混淆矩阵等)以评估其性能。计算完这些指标后,您可以分析错误分类的点,看看是否可以通过(例如)引入更多功能来提高性能。
-
您的商家信息中是否存在复制粘贴错误?
word_feats、positive_vocab、negative_vocab、neutral_vocab都定义了两次。
标签: nlp classification nltk sentiment-analysis naivebayes