TF IDF Vectorizer 无法正常工作答案

【问题标题】：TF IDF Vectorizer not functioning properlyTF IDF Vectorizer 无法正常工作
【发布时间】：2020-07-21 07:35:01
【问题描述】：

我正在研究文本分类问题并使用 TFIDF 矢量化器生成文本特征。

这里是代码

tfidf_vectorizer = TfidfVectorizer(use_idf=True,
                                                      # stop_words=English_Stopwords,
                                  ngram_range=(1,3),
                                   min_df=0.10, #  ignore terms that have a document frequency strictly lower than the given threshold
                                   max_df=0.80, 
                                  smooth_idf=True)        
fitted_vect = tfidf_vectorizer.fit(df_sample[TEXT_FEAT])
transformed_X_train = tfidf_vectorizer.transform(X_train)
transformed_X_val = tfidf_vectorizer.transform(X_val)

我查了词汇表，它只包含 162 个单词，而停用词列表非常庞大。这里有什么问题。

print(len(fitted_vect.vocabulary_))
# 162
print(len(fitted_vect.stop_words_))
# 16969712

【问题讨论】：

标签： python-3.x nlp tfidfvectorizer

【解决方案1】：

如果唯一问题是停用词抑制，则只需将参数 stop_words = 'english' 添加到矢量化器。
注意：它是注意到问题中有一个自定义停用词列表已注释掉。如果需要，可以将这些停用词附加到现有的停用词。

tfidf_vectorizer = TfidfVectorizer(use_idf=True,                                                     
                                  ngram_range=(1,3),
                                  stop_words = 'english',
                                   min_df=0.10, #  ignore terms that have a document frequency strictly lower than the given threshold
                                   max_df=0.80, 
                                  smooth_idf=True)

【讨论】：