TfidfVectorizer 删除 tf-idf 分数为零的特征答案

【问题标题】：TfidfVectorizer remove features with zero tf-idf scoreTfidfVectorizer 删除 tf-idf 分数为零的特征
【发布时间】：2016-09-07 15:58:56
【问题描述】：

我想使用 python 对文档进行聚类。首先，我生成具有 tf-idf 分数的文档 x 术语矩阵，如下所示：

tfidf_vectorizer_desc = TfidfVectorizer(min_df=1, max_df=0.9,use_idf=True, tokenizer=tokenize_and_stem)
%time tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(descriptions) #fit the vectorizer to text
desc_feature_names = tfidf_vectorizer_desc.get_feature_names()

矩阵形状为 (1510, 6862)

第一个文档每个词条的得分：

dense = tfidf_matrix_desc.todense()
print(len(dense[0].tolist()[0]))
dataset0 = dense[0].tolist()[0] 
phrase_scores = [pair for pair in zip(range(0, len(dataset0)), dataset0) if pair[1] > 0]
print(len(phrase_scores))

输出：

print(len(dense[0].tolist()[0])) -> 6862
打印(len(phrase_scores)) -> 48 *第一个文档只有 48 个大于 0.0 的词条。

现在我想从矩阵中识别给定数据集的 tfidf 分数为 0 的所有特征（术语）。我怎样才能做到这一点？

for col in tfidf_matrix_desc.nonzero()[1]:
    print(feature_names[col], ' - ', tfidf_matrix[0, col])

【问题讨论】：

如果min_df=0.1 和max_df=0.9 并且您没有预先指定词汇表，一个特征怎么会总是等于零？
我已经修改了问题，并包含了一个示例，其中第一个数据集的术语的 tf-idf 分数为零。
查看TfidfVectorizer.get_feature_names的文档。

标签： python tf-idf

【解决方案1】：

以防万一有人需要类似的东西，我使用的是以下内容：

# Xtr is the output sparse matrix from TfidfVectorizer
# min_tfidf is a threshold for defining the "new" 0
def remove_zero_tf_idf(Xtr, min_tfidf=0.04):
    D = Xtr.toarray() # convert to dense if you want
    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0) # find features that are 0 in all documents
    D = np.delete(D, np.where(tfidf_means == 0)[0], axis=1) # delete them from the matrix
    return D

【讨论】：