Python nltk 使用大型特征集进行分类（Replicate Go Et Al 2009）答案

【问题标题】：Python nltk classify with large feature set (Replicate Go Et Al 2009)Python nltk 使用大型特征集进行分类（Replicate Go Et Al 2009）
【发布时间】：2016-08-13 08:44:06
【问题描述】：

我正在尝试复制 Go Et Al。 Twitter 情绪分析可以在这里找到http://help.sentiment140.com/for-students 我遇到的问题是功能数量为 364464。我目前正在使用 nltk 和 nltk.NaiveBayesClassifier 来执行此操作，其中推文包含 1,600,000 条推文的复制并且存在极性：

for tweet in tweets:
    tweet[0] = extract_features(tweet[0], features)

classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))

除了 extract_features 函数之外，一切都不需要很长时间

def extract_features(tweet, featureList):
  tweet_words = set(tweet)
  features = {}
  for word in featureList:
      features['contains(%s)' % word] = (word in tweet_words)
  return features

这是因为它为每条推文创建了一个大小为 364,464 的字典来表示是否存在某些内容。

有没有一种方法可以在不减少本文所述特征数量的情况下更快或更高效？

【问题讨论】：

我想知道您为什么不想使用与论文中相同的技术。无论如何，您可以采取的基本 NLP 步骤包括：删除停用词、进行 tfidf 矢量化并删除不太常见或非常常见的词......这些也会删除特征，但只是以不同的方式。正如我所说，我不太确定你想做什么。
如您所想，我遇到了内存问题，但我设法解决了它。感谢回复

标签： python twitter nltk sentiment-analysis

【解决方案1】：

原来有一个很棒的函数叫做： nltk.classify.util.apply_features() 你可以在这里找到http://www.nltk.org/api/nltk.classify.html

    training_set = nltk.classify.apply_features(extract_features, tweets)

我不得不更改我的 extract_features 函数，但它现在可以处理大尺寸而没有内存问题。

以下是功能描述的简要说明：

此函数的主要目的是避免为语料库中的每个标记存储所有特征集所涉及的内存开销。相反，这些特征集是根据需要懒惰地构建的。当底层令牌列表本身是惰性的（就像许多语料库阅读器的情况一样）时，内存开销的减少尤其显着。

和我改变的功能：

    def extract_features(tweet):
         tweet_words = set(tweet)
         global featureList
         features = {}
         for word in featureList:
            features[word] = False
         for word in tweet_words:
            if word in featureList:
                features[word] = True
         return features

【讨论】：