【问题标题】:Error in scikit learn TfidfVectorizer when using the parameter stop_words使用参数 stop_words 时 scikit learn TfidfVectorizer 出错
【发布时间】:2013-12-05 12:46:54
【问题描述】:

我是 scikit 学习的新手。我正在尝试进行 tfidf 矢量化以适应 1*M numpy.array 即 tot_data (在下面的代码中),由英文句子组成。 这里的 'words' 是一个 numpy.array (1*173),包含停用词列表。 我需要明确定义参数 stop_words。 如果我不显式使用参数 stop_words,代码运行良好,但下面的行显示错误。

word = numpy.array(['a','about',...])

>>> vectorizer = TfidfVectorizer(max_df=.95,stop_words=word).fit(tot_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1203, in fit
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 710, in _count_vocab
    analyze = self.build_analyzer()
  File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 225, in build_analyzer
    stop_words = self.get_stop_words()
  File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 208, in get_stop_words
    return _check_stop_list(self.stop_words)
  File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 85, in _check_stop_list
    if stop == "english":
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

【问题讨论】:

    标签: python numpy scikit-learn


    【解决方案1】:

    原因:错误的原因是numpy数组将比较传播到元素:

    >>> word == 'english'
    array([False, False, False], dtype=bool)
    

    if 语句无法将结果数组转换为布尔值:

    >>> if word == 'english': pass
    ...
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    

    解决方案: 将单词转换为普通列表:words = list(words)

    演示:

    >>> import numpy as np
    >>> from sklearn.feature_extraction.text import TfidfVectorizer
    >>> word = np.array(['one','two','three'])
    >>> tot_data = np.array(['one two three', 'who do I see', 'I see two girls'])
    >>> v = TfidfVectorizer(max_df=.95,stop_words=list(word))
    >>> v.fit(tot_data)
    TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
       ...
            tokenizer=None, use_idf=True, vocabulary=None)
    

    【讨论】:

    • @amitbisai 不客气! 标准免责声明:如果您发现问题有用,请考虑accepting it
    猜你喜欢
    • 2018-01-23
    • 1970-01-01
    • 2017-05-26
    • 2014-08-22
    • 2017-08-30
    • 2019-04-28
    • 2014-11-12
    • 2015-04-04
    • 2015-08-30
    相关资源
    最近更新 更多