如何从所有文档集中获取单词的 TF-IDF 值？答案

【问题标题】：How to get TF-IDF value of a word from all set of documents?如何从所有文档集中获取单词的 TF-IDF 值？
【发布时间】：2022-04-21 12:45:34
【问题描述】：

我需要一个单词的 TF-IDF 值，该单词可以在多个文档中找到，而不仅仅是单个文档或特定文档。

例如，考虑这个语料库语料库 = [ '这是第一份文件。', '这个文件是第二个文件。', '这是第三个。', '这是第一份文件吗？', “这是第二头牛吗？为什么是蓝色的？”， ]

我想获得文档 1 和 4 中单词 'FIRST' 的 TD-IDF 值。TF-IDF 值是根据该特定文档计算的，在这种情况下，我将为两个单独的文档获得 2 分。但是，我需要同时考虑所有文档的单词“FIRST”的单个分数。

有什么方法可以从所有文档集中获得一个单词的 TF-IDF 分数？有没有其他方法或技术可以帮助我解决问题？

【问题讨论】：

标签： python scikit-learn nlp tf-idf tfidfvectorizer

【解决方案1】：

tl;博士

Tf-Idf 不是用来衡量单词的。您无法计算单词的 Tf-Idf。您可以计算一个词在语料库中出现的频率。

什么是 TfIdf

Tf-Idf 根据文档计算单词的分数！它为对文档的频繁 (TF) 和特殊 (IDF) 词给予高分。 TF-IDF 的目标是计算文档之间的相似度，而不是加权词。

maaniB 给出的解决方案本质上只是词的归一化频率。根据您需要完成的任务，您应该找到另一个衡量词的指标（频率通常是一个很好的开始）。

我们可以看到，在 doc 5 中，Tf-Idf 对“cow”的评分更高，因为“cow”是该文档特有的，但在 maaniB 的解决方案中丢失了。

示例

例如，我们将比较 'cow' 和 'is' 的 Tf-Idf。 TF-IDF 公式为（不含对数）：Tf * N / Df。 N是文档数，Tf是word在document中出现的频率，Df是word出现的文档数。

'is' 出现在每个文档中，因此它的 Df 将为 5。它在文档 1、2、3 和 4 中出现一次，因此 Tf 将为 1，在文档 5 中出现两次。所以文档 1,2,3,4 中 'is' 的 TF-IDF 将是 1 * 5 / 5 = 1;在 doc 5 中，它将是 2 * 5 / 5 = 2。

'cow' 仅出现在第 5 个文档中，因此其 Df 为 1。它在文档 5 中出现一次，因此其 Tf 为 1。所以文档 5 中'cow' 的 TF-IDF 将为 1 * 5 / 1 = 5；并且在所有其他文档中：0 * 5 / 1 = 0。

总结 'is' 在 doc 5 中非常频繁（出现两次），但不是特定于 doc 5（出现在每个文档中），因此它的 Tf-Idf 低于 @987654329 中的一个@ 只出现一次，但只出现在一个文档中！

【讨论】：

【解决方案2】：

我认为您可以加入您的文档并重新计算TF-IDF 分数。

我认为您当前的实现是：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

mylist = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'Is this the second cow?, why is it blue?',
]
df = pd.DataFrame({"texts": mylist})
tfidf_vectorizer = TfidfVectorizer(ngram_range=[1, 1])
tfidf_separate = tfidf_vectorizer.fit_transform(df["texts"])

df_tfidf = pd.DataFrame(
    tfidf_separate.toarray(), columns=tfidf_vectorizer.get_feature_names(), index=df.index
)
df_tfidf
        and      blue       cow  document     first        is        it       one    second       the     third      this       why
0  0.000000  0.000000  0.000000  0.501885  0.604615  0.357096  0.000000  0.000000  0.000000  0.357096  0.000000  0.357096  0.000000
1  0.000000  0.000000  0.000000  0.757554  0.000000  0.269503  0.000000  0.000000  0.456308  0.269503  0.000000  0.269503  0.000000
2  0.521203  0.000000  0.000000  0.000000  0.000000  0.248356  0.000000  0.521203  0.000000  0.248356  0.521203  0.248356  0.000000
3  0.000000  0.000000  0.000000  0.501885  0.604615  0.357096  0.000000  0.000000  0.000000  0.357096  0.000000  0.357096  0.000000
4  0.000000  0.407798  0.407798  0.000000  0.000000  0.388636  0.407798  0.000000  0.329009  0.194318  0.000000  0.194318  0.407798

如果您加入您的文档：

total = [' '.join(mylist)]
df2 = pd.DataFrame({"texts": total})
tfidf_total = tfidf_vectorizer.fit_transform(df2["texts"])
df_tfidf2 = pd.DataFrame(
    tfidf_total.toarray(), columns=tfidf_vectorizer.get_feature_names(), index=df2.index
)
df_tfidf2
       and     blue      cow  document   first      is       it      one  second      the    third     this      why
0  0.09245  0.09245  0.09245    0.3698  0.1849  0.5547  0.09245  0.09245  0.1849  0.46225  0.09245  0.46225  0.09245

【讨论】：

【解决方案3】：

感谢@maaniB 的回答。

@Milan - 您可以尝试以下方法/代码来获取单个文档的 TF-IDF 值。

更简单的方法是获取特征名称并获取空间数组的总和并从中创建一个 DataFrame。

代码如下：

mylist = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'Is this the second cow?, why is it blue?']


df = pd.DataFrame({"texts": mylist})
tfidf_vectorizer = TfidfVectorizer(ngram_range=[1, 1])
tfidf_separate = tfidf_vectorizer.fit_transform(df["texts"])


word_lst = tfidf_vectorizer.get_feature_names()
count_lst = tfidf_separate.toarray().sum(axis=0)

vocab_df = pd.DataFrame((zip(word_lst,count_lst)),
                          columns= ["vocab","tfidf_value"])

vocab_df.sort_values(by="tfidf_value",ascending=False)
print(vocab_df)

vocab     tfidf_value
0 and       0.521203
1 blue      0.407798
2 cow       0.407798
3 document  1.761324
4 first     1.209230
5 is        1.620686
6 it        0.407798
7 one       0.521203
8 second    0.785317
9 the       1.426368
10 third    0.521203
11 this     1.426368
12 why      0.407798

希望对你有帮助！！

【讨论】：

请将代码和数据添加为文本 (using code formatting)，而不是图像。图片：A）不允许我们复制粘贴代码/错误/数据进行测试； B) 不允许根据代码/错误/数据内容进行搜索；和many more reasons。除了代码格式的文本之外，只有在图像添加了一些重要的东西，而不仅仅是文本代码/错误/数据传达的内容时，才应该使用图像。
谢谢，我已经把代码转成文本格式了