Python 朴素贝叶斯分类器在电影评论语料库上训练以测试推文答案

【问题标题】：Python Naive Bayes Classifier trained on Movie Review Corpus to test on TweetsPython 朴素贝叶斯分类器在电影评论语料库上训练以测试推文
【发布时间】：2016-03-11 12:10:56
【问题描述】：

import nltk.classify.util
import csv
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
    return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

我对 Python 非常陌生，正在尝试对推文进行情绪分析。我正在使用 NLTK 包中内置的朴素贝叶斯分类器。我正在电影评论语料库上对其进行测试，并希望在我使用 Tweepy 存储到 .txt 或 .csv 文件中的推文上进行测试。谁能帮助弄清楚如何从我的输出文件中的推文中测试这个分类器？谢谢！

【问题讨论】：

标签： python twitter nlp nltk naivebayes

【解决方案1】：

只需加载推文：

f = open('tweets.txt', 'r')
data = f.readlines()
testfeats = word_feats([tweet.split(' ') for tweet in data]) # for file with tweets separated by line

然后你可以使用你的 word_feats 方法来提取特征（你可以使用 CountVectorizer 代替）。

【讨论】：

非常感谢您的回复。我在下面添加了 >>>f = open('tweets.txt', 'r') >>>data = f.readlines() >>>testfeats = word_feats(data) 但收到“ValueError: too many values解包”错误。您认为这是什么原因造成的？
您可能需要解析文本文件。格式是什么？
我在不同的文本文件中有两种不同格式的相同数据。它们仅包含推文的文本，其中一个推文由分隔符“,,”分隔，另一个由换行符分隔。
您需要对推文进行标记。要么使用string.split(' ')（参见我的示例），要么使用像 NLTK 这样的分词器。