【问题标题】:"Wrong" TF IDF Scores“错误”的 TF IDF 分数
【发布时间】:2020-12-23 13:41:51
【问题描述】:

我有 1000 个 .txt 文件并计划搜索各种关键字并计算它们的 TF-IDF 分数。但由于某种原因,结果是 > 1。然后我用 2 个 .txt 文件进行了测试:“我正在研究 nfc”“你不需要 AI” .对于 nfc 和 AI,TF-IDF 应该是 0.25,但是当我打开 .csv 时,它会显示 1.4054651081081644。

我必须承认我没有为代码选择最有效的方式。我认为错误在于文件夹,因为我最初计划按年份检查文件(2000-2010 年的年度报告)。但我取消了这些计划,并决定将所有年度报告作为一个整体进行检查。我认为文件夹的解决方法仍然是问题。我放了2个txt。文件放入文件夹“-”。有没有办法让它计数正确?

import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path


# root dir
root = '/Users/Tom/PycharmProjects/TextMining/'
#
words_to_find = ['AI', 'nfc']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.80, min_df=1, stop_words='english', use_idf=True, norm=None, vocabulary=words_to_find, ngram_range=(1, 3))
vectorizer_cnt = CountVectorizer(stop_words='english', vocabulary=words_to_find, ngram_range=(1, 3))
#
years = ['-']
year_folders = [root + folder for folder in years]
# remove previous results file
if os.path.isfile('summary.csv'):
    os.remove('summary.csv')
if os.path.isfile('tf_idf.csv'):
    os.remove('tf_idf.csv')
#process every folder (for every year)
for year_idx, year_folder in enumerate(year_folders):
    # get file paths in folder
    file_paths = []
    for file in Path(year_folder).rglob("*.txt"):
        file_paths.append(file)
    # count of files for each year
    file_cnt = len(file_paths)
    # read every file's text as string
    docs_per_year = []
    words_in_folder = 0
    for txt_file in file_paths:
        with open(txt_file, encoding='utf-8', errors="replace") as f:
            txt_file_as_string = f.read()
            words_in_folder += len(txt_file_as_string.split())
            docs_per_year.append(txt_file_as_string)
    #
    tf_idf_documents_as_array = vectorizer_tf_idf.fit_transform(docs_per_year).toarray()
    # tf_idf_documents_as_array = vectorizer_tf_idf.fit_transform([' '.join(docs_per_year)]).toarray()
    #
    cnt_documents_as_array = vectorizer_cnt.fit_transform(docs_per_year).toarray()
    #
    with open('summary.csv', 'a') as f:
        f.write('Index;Term;Count;Df;Idf;Rel. Frequency\n')
        for idx, word in enumerate(words_to_find):
            abs_freq = cnt_documents_as_array[:, idx].sum()
            f.write('{};{};{};{};{};{}\n'.format(idx + 1,
                                                    word,
                                                    np.count_nonzero(cnt_documents_as_array[:, idx]),
                                                    abs_freq,
                                                    vectorizer_tf_idf.idf_[idx],
                                                    abs_freq / words_in_folder))
        f.write('\n')

    with open('tf_idf.csv', 'a') as f:
        if not wrote_tf_idf_header:
            f.write('{}\n'.format(years[year_idx]))
            f.write('Index;Year;File;')
            for word in words_to_find:
                f.write('{};'.format(word))
            f.write('Sum\n')
            wrote_tf_idf_header = True

        for idx, tf_idfs in enumerate(tf_idf_documents_as_array):
            f.write('{};{};{};'.format(tf_idf_file_idx, years[year_idx], file_paths[idx].name))
            for word_idx, _ in enumerate(words_to_find):
                f.write('{};'.format(tf_idf_documents_as_array[idx][word_idx]))
            f.write('{}\n'.format(sum(tf_idf_documents_as_array[idx])))

            tf_idf_file_idx += 1

print()

【问题讨论】:

    标签: numpy scikit-learn tf-idf tfidfvectorizer countvectorizer


    【解决方案1】:

    我认为错误在于,您将规范定义为 norm=None,但规范应该是 documentation 中指定的 l1l2

    【讨论】:

    • 谢谢。它确实改变了价值观。我现在为我的文档样本尝试了它。在 1 个文档(672 字)中,“AI”出现 1 次。但它给了我 0,94 的值。在整套文献中出现55次(1092)。出于某种原因,我仍然认为这是不对的。所以应该是 1/672 (TF) * log2(1092/55) = 0,006415769345238 对吧?
    • 根据规范,计算的值不同。您选择了哪种规范,您是否还可以提供代码的最小可重现示例(因此包括一些数据,您可以通过这些数据获得此值)?目前您的代码无法运行,因为数据丢失。参考:stackoverflow.com/help/minimal-reproducible-example
    猜你喜欢
    • 2018-08-23
    • 2021-12-26
    • 2018-04-27
    • 2020-11-30
    • 2019-01-10
    • 2015-05-07
    • 2019-08-13
    • 2020-09-27
    • 1970-01-01
    相关资源
    最近更新 更多