【问题标题】:feed pre-computed estimates to TfidfVectorizer将预先计算的估计值提供给 TfidfVectorizer
【发布时间】:2016-03-11 06:01:01
【问题描述】:

我训练了一个 scikit-learn 的 TfidfVectorizer 实例,我想将它保存到磁盘。我将 IDF 矩阵(idf_ 属性)作为 numpy 数组保存到磁盘,并将词汇表(vocabulary_)作为 JSON 对象保存到磁盘(为了安全和其他reasons,我避免使用pickle)。我正在尝试这样做:

import json
from idf import idf # numpy array with the pre-computed IDFs
from sklearn.feature_extraction.text import TfidfVectorizer

# dirty trick so I can plug my pre-computed IDFs
# necessary because "vectorizer.idf_ = idf" doesn't work,
# it returns "AttributeError: can't set attribute."
class MyVectorizer(TfidfVectorizer):
    TfidfVectorizer.idf_ = idf

# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
                          min_df = 2,
                          norm = 'l2',
                          smooth_idf = True)

# plug vocabulary
vocabulary = json.load(open('vocabulary.json', mode = 'rb'))
vectorizer.vocabulary_ = vocabulary

# test it
vectorizer.transform(['foo bar'])
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1314, in transform
    return self._tfidf.transform(X, copy=False)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1014, in transform
    check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 627, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: idf vector is not fitted

那么,我做错了什么?我没能愚弄矢量化器对象:不知何故,它知道我在作弊(即,将预先计算的数据传递给它,而不是用实际文本对其进行训练)。我检查了矢量化器对象的属性,但找不到“受约束”、“适合”等任何东西。那么,我该如何欺骗矢量化器呢?

【问题讨论】:

    标签: python-2.7 scikit-learn


    【解决方案1】:

    好的,我想我明白了:vectorizer 实例有一个属性_tfidf,而它又必须有一个属性_idf_diagtransform 方法调用check_is_fitted 函数来检查是否存在_idf_diag。 (我错过了它,因为它是属性的一个属性。)所以,我检查了 TfidfVectorizer source code 以查看 _idf_diag 是如何创建的。然后我只是将它添加到_tfidf 属性中:

    import scipy.sparse as sp
    
    # ... code ...
    
    vectorizer._tfidf._idf_diag = sp.spdiags(idf,
                                             diags = 0,
                                             m = len(idf),
                                             n = len(idf))
    

    现在矢量化工作了。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-05-26
      • 1970-01-01
      • 1970-01-01
      • 2018-03-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多