@275365 关于 NLTK 贝叶斯分类器的数据结构的教程很棒。从更高的层面,我们可以看成,
我们输入带有情感标签的句子:
training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
让我们将特征集视为单个单词,因此我们从训练数据中提取所有可能单词的列表(我们称之为词汇表):
from nltk.tokenize import word_tokenize
from itertools import chain
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
本质上,vocabulary 与 @275365 的 all_word 相同
>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))
>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
>>> print vocabulary == all_words
True
从每个数据点(即每个句子和 pos/neg 标签),我们想知道一个特征(即词汇表中的一个词)是否存在。
>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> print {i:True for i in vocabulary if i in sentence}
{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}
但我们也想告诉分类器哪个词不存在于句子中而是存在于词汇表中,所以对于每个数据点,我们列出词汇表中所有可能的词,并说出一个词是否存在:
>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:True for i in vocabulary if i in sentence}
>>> y = {i:False for i in vocabulary if i not in sentence}
>>> x.update(y)
>>> print x
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}
但由于这会在词汇表中循环两次,这样做会更有效率:
>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:(i in sentence) for i in vocabulary}
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}
所以对于每个句子,我们想告诉每个句子的分类器哪个单词存在,哪个单词不存在,并给它一个 pos/neg 标签。我们可以称之为feature_set,它是一个由x(如上所示)及其标签组成的元组。
>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]
然后我们将 feature_set 中的这些特征和标签输入分类器进行训练:
from nltk import NaiveBayesClassifier as nbc
classifier = nbc.train(feature_set)
现在你有一个训练有素的分类器,当你想标记一个新句子时,你必须对新句子进行“特征化”,以查看新句子中的哪个单词在分类器训练的词汇表中:
>>> test_sentence = "This is the best band I've ever heard! foobar"
>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
注意:从上面的步骤可以看出,朴素贝叶斯分类器无法处理词汇表之外的单词,因为 foobar 标记在您对其进行特征化后消失了。
然后你将特征化的测试句子输入分类器并要求它进行分类:
>>> classifier.classify(featurized_test_sentence)
'pos'
希望这可以更清楚地说明如何将数据输入 NLTK 的朴素贝叶斯分类器以进行情感分析。这是没有 cmets 和演练的完整代码:
from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain
training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
classifier = nbc.train(feature_set)
test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
print "test_sent:",test_sentence
print "tag:",classifier.classify(featurized_test_sentence)