利用 NLTK (Python 3.5) 查找词频和逆文档频率答案

【问题标题】：Finding Term Frequency and Inverse Document Frequency Utilizng NLTK (Python 3.5)利用 NLTK (Python 3.5) 查找词频和逆文档频率
【发布时间】：2016-09-30 10:13:40
【问题描述】：

我正在尝试利用 NLTK 对一批文件（它们恰好是 IBM 的公司新闻稿）执行词频 (TF) 和逆文档频率 (IDF) 分析。我知道 NLTK 是否具有 TF IDF 功能has been disputed on SO beforehand 的断言，但我发现文档表明该模块确实具有它们：

http://www.nltk.org/_modules/nltk/text.html

http://www.nltk.org/api/nltk.html#nltk.text.TextCollection

我从未见过或使用过“self”或 init 来预先执行代码。这就是我到目前为止所拥有的。非常感谢有关如何修改此代码以使其正常工作的任何建议。我目前拥有的东西没有返回任何东西。我真的不明白 NLTK 文档中的“来源”、“自我”或“术语”和“文本”代表什么。

import nltk.corpus
from nltk.text import TextCollection
from nltk.corpus import gutenberg
gutenberg.fileids()

ibm1 = gutenberg.words('ibm-github.txt')
ibm2 = gutenberg.words('ibm-alior.txt')

mytexts = TextCollection([ibm1, ibm2])
term = 'software'

def __init__(self, source):
    if hasattr(source, 'words'):
        source = [source.words(f) for f in source.fileids()]

    self._texts = source
    Text.__init__(self, LazyConcatenation(source))
    self._idf_cache = {}

def tf(self, term, mytexts):
    result = mytexts.count(term) / len(mytexts)
    print(result)

【问题讨论】：

标签： python-3.x nltk

【解决方案1】：

from nltk.text import TextCollection
from nltk.book import text1, text2, text3

mytexts = TextCollection([text1, text2, text3])

# Print the IDF of a word
print(mytexts.idf("Moby"))

# tf_idf
print(mytexts.tf_idf("Moby", text1))

【讨论】：