NLTK，朴素贝叶斯：为什么有些特征没有？答案

【问题标题】：NLTK, Naive Bayes: why are some features NONE?NLTK，朴素贝叶斯：为什么有些特征没有？
【发布时间】：2016-08-16 03:26:15
【问题描述】：

我正在尝试使用 NLTK 实现朴素贝叶斯。

当我打印出信息量最大的特征时，其中一些被指定为“NONE”。这是为什么呢？

我使用的是词袋模型：当我输出特征时，每个特征都被赋值为真。

NONE 从何而来？

我读到了

The feature value 'None' is reserved for unseen feature values;

这里：http://www.nltk.org/_modules/nltk/classify/naivebayes.html

这是什么意思？

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import nltk.data
from nltk.corpus import stopwords
import collections
from nltk.classify.util import accuracy
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import nltk.metrics

def bag_of_words(words):
    return dict([(word, True) for word in words])

def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

def bag_of_words_without_stopwords(words):
    badwords = stopwords.words("german")
    return bag_of_words_not_in_set(words, badwords)

def label_feats_from_corpus(corp, feature_detector=bag_of_words_without_stopwords):
    label_feats = collections.defaultdict(list)
    for label in corp.categories():
        for fileid in corp.fileids(categories=[label]):
            feats = feature_detector(corp.words(fileids=[fileid]))
            label_feats[label].append(feats)
    return label_feats

def split_label_feats(lfeats, split=0.75):
    train_feats = []
    test_feats = []
    for label, feats in lfeats.items():
        cutoff = int(len(feats) * split)
        train_feats.extend([(feat, label) for feat in feats[:cutoff]])
        test_feats.extend([(feat, label) for feat in feats[cutoff:]])
    return train_feats, test_feats


reader = CategorizedPlaintextCorpusReader('D:/corpus/', r'.*\.txt', cat_pattern=r'(\w+)/*')

all_words = nltk.FreqDist(w.lower() for w in reader.words())

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

bigrams = bigram_word_feats(reader.words());

lfeats = label_feats_from_corpus(reader)

train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
nb_classifier = NaiveBayesClassifier.train(train_feats)


print("------------------------")
acc = accuracy(nb_classifier, test_feats)
print(acc)
print("------------------------")
feats = nb_classifier.most_informative_features(n=25)
for feat in feats:
    print(feat) # some are NONE

print("------------------------")
nb_classifier.show_most_informative_features(n=25) # some are NONE

【问题讨论】：

标签： python nltk naivebayes

【解决方案1】：

我认为NaiveBayesClassifier 类的完整文档字符串解释了：

如果分类器遇到具有以下特征的输入从未见过任何标签，而不是分配一个所有标签的概率为 0，它将忽略该特征。

特征值“None”是为看不见的特征值保留的；您通常不应将“无”用作其中之一的特征值您自己的功能。

如果您的数据包含从未与标签关联的特征，则该特征的值将为None。假设你训练了一个带有特征W、X 的分类器，然后用特征W、X、Z 对某些东西进行分类。 None 的值将用于特征 Z，因为该特征在训练中从未见过。

进一步说明：

看到None 我并不感到惊讶，因为语言数据很少。在电影评论语料库中，会有一些词只出现在 1 或 2 个文档中。例如，演员的名字或标题中的单词可能只出现在 1 条评论中。

在分析之前从语料库中删除频繁（停止）和不频繁的单词是很常见的。对于他们的主题模型 Science，Blei and Lafferty (2007) 写道：“这个集合中的总词汇量是 375,144 个术语。我们修剪了出现少于 70 次的 356,195 个术语以及 296 个停止词。”

【讨论】：

感谢您的回复！我仍然不完全明白这是怎么发生的。假设我的语料库包含关于这个词。我使用了词袋模型，所以我给这个特征赋值“真”。 about这个词在我的语料库中，不在我的停用词列表中，当我输出特征时，我可以看到每个特征都被赋值为“true”。你能在我的代码中发现任何错误吗？我将 75% 的语料库用于训练，25% 用于测试。 “none”表示在我的测试部分中存在从未分配过标签的单词？
我在您的代码中看不到任何错误。是的，我的理解是None 表示在训练期间从未遇到带有标签的单词。我用一些额外的解释更新了我的答案。
嗯……还是有点奇怪。除了停用词列表，我不做任何修剪。你的意思是 NLTK 的 NB 分类器会自动进行这种修剪吗？
你得到None 因为你没有做任何修剪。如果您修剪词汇表，None 值的频率会下降； NB 分类器不会自动修剪。这是您在分析之前为了降低特征空间的维数而做的事情。
啊，现在我明白了。删除不常用的单词，否则会被分配为“无”，对吗？（对不起，我对此完全陌生）