将 tf idf pandas 数据帧转换为 tf idf 矩阵答案

【问题标题】：transform tf idf pandas dataframe into a tf idf matrix将 tf idf pandas 数据帧转换为 tf idf 矩阵
【发布时间】：2016-02-21 18:40:10
【问题描述】：

如何将以下带有多个文档中每个单词的 tf-idf 分数的 pandas 数据帧转换为一个名为“tfdif”的矩阵，以便我可以实现例如

from sklearn.feature_extraction.text import TfidfVectorizer from nltk.stem.porter import PorterStemmer str = 'this sentence has unseen text such as computer but also king lord juliet' response = tfidf.transform([str])

【问题讨论】：

标签： python pandas tf-idf

【解决方案1】：

您需要使用原始原始文档拟合TfidfVectorizer，然后才能使用它来转换新文档。

如果您无法访问原始文档，您始终可以通过构建字典来恢复每个单词的idf weights：

idfs[word] = log{(# documents) / (# documents where word has non-zero tf-idf weight)}

稍后您可以使用该词典计算新句子的 tf-idf 权重：

from collections import Counter
words = sentence.split()
s_tfs = Counter(words)
s_idfs = {word: idfs.get(word, 0) for word in words}
s_tfidf = {word: s_tfs.get(word, 0) * s_idfs.get(word, 0) for word in idfs.keys()}

【讨论】：

谢谢。通过恢复每个单词的 idf 权重，您的意思是测试集中新单词的 idf 权重，对吗？测试词的 tf-idf 权重知识如何帮助我将它们分类到文档中？
是的，您需要测试集中单词的 tf-idf 权重。关于第二个问题，我认为您应该阅读一些有关文本分类的基本教程。但基本上你使用训练集中的原始 tf-idf 权重来训练分类器，然后使用它使用它的 tf-idf 权重对测试文档进行分类。