用于情感分析的 nltk NaiveBayesClassifier 训练答案

【问题标题】：nltk NaiveBayesClassifier training for sentiment analysis用于情感分析的 nltk NaiveBayesClassifier 训练
【发布时间】：2014-01-16 15:24:47
【问题描述】：

我正在使用句子在 Python 中训练 NaiveBayesClassifier，它给了我下面的错误。我不明白错误可能是什么，任何帮助都会很好。

我尝试了许多其他输入格式，但错误仍然存在。代码如下：

from text.classifiers import NaiveBayesClassifier
from text.blob import TextBlob
train = [('I love this sandwich.', 'pos'),
         ('This is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('This is my best work.', 'pos'),
         ("What an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('He is my sworn enemy!', 'neg'),
         ('My boss is horrible.', 'neg') ]

test = [('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ]
classifier = nltk.NaiveBayesClassifier.train(train)

我包括下面的回溯。

Traceback (most recent call last):
  File "C:\Users\5460\Desktop\train01.py", line 15, in <module>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:\Users\5460\Desktop\train01.py", line 15, in <genexpr>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize
    return _word_tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize
    text = re.sub(r'^\"', r'``', text)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

【问题讨论】：

标签： python nlp nltk sentiment-analysis textblob

【解决方案1】：

您需要更改数据结构。这是您目前的train 列表：

>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

不过，问题是每个元组的第一个元素应该是一个特征字典。因此，我会将您的列表更改为分类器可以使用的数据结构：

>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

您的数据现在应该是这样的结构：

>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]

请注意，每个元组的第一个元素现在是一个字典。现在您的数据已经到位并且每个元组的第一个元素是一个字典，您可以像这样训练分类器：

>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                 awesome = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                   place = False             neg : pos    =      1.2 : 1.0
                horrible = False             pos : neg    =      1.2 : 1.0

如果你想使用分类器，你可以这样做。首先，你从一个测试句开始：

>>> test_sentence = "This is the best band I've ever heard!"

然后，您对句子进行标记，并找出句子与 all_words 共享哪些单词。这些构成了句子的特征。

>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}

您的功能现在将如下所示：

>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

然后您只需对这些特征进行分类：

>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above

这个测试句似乎是肯定的。

【讨论】：

你好。谢谢你的建议！我有几个问题，对此我很陌生。什么是钥匙？我也尝试了您的方法，但出现以下错误：' Traceback（最近一次调用最后）：文件“C:\Users\5460\Desktop\train01.py”，第 16 行，在 all_words = set(word .lower() forpassage in train for word in paragraph[0].keys()) 文件“C:\Users\5460\Desktop\train01.py”，第 16 行，在 all_words = set(word.lower () forpassage in train for word in paragraph[0].keys()) AttributeError: 'set' object has no attribute 'keys'' 任何帮助都会被重视！谢谢！
@student001 哎呀。对于那个很抱歉。当我最初写这个答案时，我遗漏了一行。我现在已经确定了答案。主要变化是这样的： all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0])) 你现在不需要担心键了。
你好。谢谢你的时间！不过，这仍然给我一个错误。错误：预期的字符串或缓冲区。对此有何想法？
@student001 如果您仍然遇到问题，我已将我的代码从上到下包含在上面已编辑的答案中，从您的火车列表开始。如果您一个接一个地运行这些语句，您应该会看到我在此处显示的相同结果。我假设您正在使用 Python 2.x。
字符串是可散列的，而字典不是。这个答案完全倒退了。只需在控制台尝试hash('abc') 和hash({1:2})。最终的结构可能会起作用，但给出的为什么它起作用的原因没有任何意义。

【解决方案2】：

您似乎正在尝试使用 TextBlob，但正在训练 NLTK NaiveBayesClassifier，正如其他答案中指出的那样，它必须传递一个特征字典。

TextBlob 有一个默认的特征提取器，它指示训练集中的哪些单词包含在文档中（如其他答案所示）。因此，TextBlob 允许您按原样传递数据。

from textblob.classifiers import NaiveBayesClassifier

train = [('This is an amazing place!', 'pos'),
        ('I feel very good about these beers.', 'pos'),
        ('This is my best work.', 'pos'),
        ("What an awesome view", 'pos'),
        ('I do not like this restaurant', 'neg'),
        ('I am tired of this stuff.', 'neg'),
        ("I can't deal with this", 'neg'),
        ('He is my sworn enemy!', 'neg'),
        ('My boss is horrible.', 'neg') ] 
test = [
        ('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ] 


classifier = NaiveBayesClassifier(train)  # Pass in data as is
# When classifying text, features are extracted automatically
classifier.classify("This is an amazing library!")  # => 'pos'

当然，简单的默认提取器并不适用于所有问题。如果您想了解如何提取特征，只需编写一个函数，该函数将一串文本作为输入并输出特征字典并将其传递给分类器。

classifier = NaiveBayesClassifier(train, feature_extractor=my_extractor_func)

我鼓励您在这里查看简短的 TextBlob 分类器教程：http://textblob.readthedocs.org/en/latest/classifiers.html

【讨论】：

谢谢你的回答，我测试过从 csv 文件导入数据，但是程序 print(cl.classify("thermal spray")) 我有这个 NameError: name 'cl' is not defined

【解决方案3】：

@275365 关于 NLTK 贝叶斯分类器的数据结构的教程很棒。从更高的层面，我们可以看成，

我们输入带有情感标签的句子：

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

让我们将特征集视为单个单词，因此我们从训练数据中提取所有可能单词的列表（我们称之为词汇表）：

from nltk.tokenize import word_tokenize
from itertools import chain
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

本质上，vocabulary 与 @275365 的 all_word 相同

>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))
>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
>>> print vocabulary == all_words
True

从每个数据点（即每个句子和 pos/neg 标签），我们想知道一个特征（即词汇表中的一个词）是否存在。

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> print {i:True for i in vocabulary if i in sentence}
{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}

但我们也想告诉分类器哪个词不存在于句子中而是存在于词汇表中，所以对于每个数据点，我们列出词汇表中所有可能的词，并说出一个词是否存在：

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x =  {i:True for i in vocabulary if i in sentence}
>>> y =  {i:False for i in vocabulary if i not in sentence}
>>> x.update(y)
>>> print x
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

但由于这会在词汇表中循环两次，这样做会更有效率：

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:(i in sentence) for i in vocabulary}
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

所以对于每个句子，我们想告诉每个句子的分类器哪个单词存在，哪个单词不存在，并给它一个 pos/neg 标签。我们可以称之为feature_set，它是一个由x（如上所示）及其标签组成的元组。

>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]

然后我们将 feature_set 中的这些特征和标签输入分类器进行训练：

from nltk import NaiveBayesClassifier as nbc
classifier = nbc.train(feature_set)

现在你有一个训练有素的分类器，当你想标记一个新句子时，你必须对新句子进行“特征化”，以查看新句子中的哪个单词在分类器训练的词汇表中：

>>> test_sentence = "This is the best band I've ever heard! foobar"
>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

注意：从上面的步骤可以看出，朴素贝叶斯分类器无法处理词汇表之外的单词，因为 foobar 标记在您对其进行特征化后消失了。

然后你将特征化的测试句子输入分类器并要求它进行分类：

>>> classifier.classify(featurized_test_sentence)
'pos'

希望这可以更清楚地说明如何将数据输入 NLTK 的朴素贝叶斯分类器以进行情感分析。这是没有 cmets 和演练的完整代码：

from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

classifier = nbc.train(feature_set)

test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence =  {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

print "test_sent:",test_sentence
print "tag:",classifier.classify(featurized_test_sentence)

【讨论】：

你能告诉我上述数据集训练朴素贝叶斯分类器所花费的时间吗？还要估计用 10 万个句子的语料库进行训练？我是新手，想在尝试之前先估计一下……
不。不会告诉你它训练了多长时间，因为（i）你应该能够在任何现代（4-5 年前）笔记本电脑上运行它，（ii）如果不能，你可以使用kaggle kernel，只需复制并粘贴代码.不需要估计时间，除非你发现它挂在你的机器上，如果是这样，使用 Kaggle 内核。我保证不会花很多时间。
先试后问。更好的是，计时并告诉其他人花了多长时间；P
这确实挂在我的机器上，如果我用 50,000 甚至 5000 替换你的 10 个样本训练句子。它适用于 1000 个句子，但这太可怜了，没有用。 nltk 有自己的分类器，不会破坏大型数据集。
太棒了！ @MarcMaxson，您已经尝试过了。是的，这需要很长时间，这是由于en.wikipedia.org/wiki/Curse_of_dimensionality =) 而我之所以能够完成培训，是因为我的机器有足够的 RAM 来保存内存中每个文档的所有功能。