如何向 scikit-learn DictVectorizer 添加功能？答案

【问题标题】：How to add features to scikit-learn DictVectorizer?如何向 scikit-learn DictVectorizer 添加功能？
【发布时间】：2015-06-15 13:02:45
【问题描述】：

我正在使用 scikit-learn 中的 MultinomialNB 模型训练垃圾邮件检测器。我使用 DictVectorizer 类将标记转换为字数（即特征）。我希望能够随着时间的推移使用新数据来训练模型（在这种情况下，以聊天消息的形式传入我们的应用服务器）。为此，partial_fit 函数看起来很有用。

但我似乎无法弄清楚在最初“训练”后如何放大 DictVectorizer 的大小。如果出现从未见过的新特征/单词，它们就会被忽略。我想做的是腌制当前版本的模型和 DictVectorizer 并在每次我们进行新的训练时更新它们。这可能吗？

【问题讨论】：

标签： python machine-learning scikit-learn spam-prevention naivebayes

【解决方案1】：

在documentation中，他们使用字典来做DictVectorizer的学习阶段。您可能可以将新功能添加到原始字典并执行fit_transform。这样您就可以将您的值添加到 DictVectoriser。

注意partial_fit 方法，它是heavy treatment 的一种。正如方法文档中所述，存在处理开销。

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)

# Learn and do treatment

# when new data come (value is a dictionary)
D.append(values)
X = v.fit_transform(D) # do the fit again

# 2 choices, 
# wait for more modification before learning 
# or learn each time you have modification (not really performant)

【讨论】：

如果我需要一直保存整个字典，这对于流式传输大量数据的情况没有用处。然后基本上，每次新数据到达时，我都会重新训练整个历史记录。