使用 NaiveBayesClassifier 对文本进行分类答案

【问题标题】：Classify text using NaiveBayesClassifier使用 NaiveBayesClassifier 对文本进行分类
【发布时间】：2018-11-07 18:01:52
【问题描述】：

我有一个文本文件，每一行都有一个句子：例如 ""您是否在您的银行帐户中注册了您的电子邮件 ID？"

我想把它归类为疑问句。仅供参考，这些是来自银行网站的句子。我见过this answer 使用这个 nltk 代码块：

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

所以我对我的文本文件进行了一些预处理，即词干、删除停用词等，以使每个句子变成一个词袋。从上面的代码中，我有一个训练有素的分类器。如何在我的句子文本文件（原始或预处理）上实现它？

更新：here 是我的文本文件的一个示例。

【问题讨论】：

您需要使用（scikit-learn.org/stable/modules/generated/…）转换文档，然后使用分类器。你能上传你的数据吗？
@seralouk 谢谢你的回复，我现在看链接！我已经用我的数据示例更新了问题。
不知道为什么我被否决了，我应该提供更多信息吗？
@seralouk 不，它们都是句子的字符串。我已经给出了预处理版本。如果你愿意，我可以附上去掉数字、词干和停用词的处理版本？
@seralouk 我不能使用 nps_chat 训练分类器并从中获取样本数据吗？

标签： python-3.x machine-learning scikit-learn nlp nltk

【解决方案1】：

假设您已经按照我们的讨论对文档数据进行了预处理，您可以执行以下操作：

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))

0.668

对于您的数据，您可以迭代您的行并拟合、预测：

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

【讨论】：

【解决方案2】：

对文本文件中的所有行执行此操作：

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

【讨论】：