NLTK 中的 FreqDist 未对输出进行排序答案

【问题标题】：FreqDist in NLTK not sorting outputNLTK 中的 FreqDist 未对输出进行排序
【发布时间】：2014-04-13 12:23:56
【问题描述】：

我是 Python 新手，我正在尝试自学语言处理。 python 中的 NLTK 有一个名为 FreqDist 的函数，它给出了文本中单词的频率，但由于某种原因它不能正常工作。

这是教程让我写的：

fdist1 = FreqDist(text1)
vocabulary1 = fdist1.keys()
vocabulary1[:50]

所以基本上它应该给我一个文本中最常见的 50 个单词的列表。但是，当我运行代码时，结果是 50 个最不出现频率的词，按照最不频繁到最频繁的顺序排列，而不是相反。我得到的输出如下：

[u'succour', u'four', u'woods', u'hanging', u'woody', u'conjure', u'looking', u'eligible', u'scold', u'unsuitableness', u'meadows', u'stipulate', u'leisurely', u'bringing', u'disturb', u'internally', u'hostess', u'mohrs', u'persisted', u'Does', u'succession', u'tired', u'cordially', u'pulse', u'elegant', u'second', u'sooth', u'shrugging', u'abundantly', u'errors', u'forgetting', u'contributed', u'fingers', u'increasing', u'exclamations', u'hero', u'leaning', u'Truth', u'here', u'china', u'hers', u'natured', u'substance', u'unwillingness...]

我完全是在复制教程，但我一定是做错了什么。

这里是教程的链接：

http://www.nltk.org/book/ch01.html#sec-computing-with-language-texts-and-words

该示例位于标题“图 1.3：计算文本中出现的单词（频率分布）”的正下方

有人知道我该如何解决这个问题吗？

【问题讨论】：

这是你的输出：['wonderingly', 'wonderments', 'wondrousness', 'wonst', 'woodcock', 'wooded', 'woodland', 'woodpecker', 'woody', 'wooing', 'woracious', 'wordless', 'worker', 'workers', 'workmen', 'worldly', 'worming', 'worried', 'worryings', 'wounding', 'wounds', 'wrangling', 'wrap', 'wrapall', 'wrapping', 'wreak', 'wreath', 'wrestling', 'wrestlings', 'wretchedly', 'wriggles', 'wring', 'wrinkling', 'writhed', 'wrung', 'yawed', 'yawing', 'yawingly', 'yearly', 'yokes', 'yoking', 'youngest', 'youngish', 'yourselbs', 'zag', 'zay', 'zephyr', 'zig', 'zoned', 'zoology']?
还是顺序颠倒了？还是你完全得到了别的东西？
我得到了这个：[u'succour', u'four', u'woods', u'hanging', u'woody', u'conjure', u'looking', u'eligible', u'scold', u'unsuitable', u'meadows', u'stipulate', u'leisurely', u'bringing', u'disturb', u'internally', u'hostess', u'mohrs', u'persisted', u'Does', u'succession', u'tired', u'cordially', u'pulse', u'elegant', u'second', u'sooth', u'耸肩', u'大量', u'错误', u'忘记', u'贡献', u'手指', u'增加', u'感叹', u'英雄', u'倾斜', u'Truth'，u'here'，u'china'，u'hers'，u'natured'，u'substance'，u'unwillingness...]
我认为你的是按字母顺序排列的最后五十个单词。我可能是错的，但看起来我得到的是文本中出现次数最少的单词
您可能需要检查您的text1。你如何定义text1？当我在您的帖子中运行代码时，我确实在您最近的评论（您正在寻找的输出）中获得了您的输出。我的第一条评论的输出是相同文本的最后五十个单词（按FreqDist 排序）。

标签： python nlp nltk

【解决方案1】：

来自NLTK's GitHub：

NLTK3 中的 FreqDist 是 collections.Counter 的包装器； Counter 提供most_common() 方法按顺序返回项目。 FreqDist.keys() 方法由标准库提供；它没有被覆盖。我认为我们与 stdlib 的兼容性越来越好。

googlecode 上的文档非常旧，它们来自 2011 年。更多最新文档可以在 http://nltk.org 网站上找到。

因此，对于 NLKT 版本 3，请使用 fdist1.keys()[:50]，而不是 fdist1.most_common(50)。

tutorial 也已更新：

fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906

【讨论】：

感谢 Hugo，我也为 @user3528925 遇到的同样问题而苦苦挣扎，您的回答有所帮助。我的 NLTK 版本也是 3。

【解决方案2】：

作为使用 FreqDist 的替代方法，您可以简单地使用来自 `collections 的 Counter，另请参阅 https://stackoverflow.com/questions/22952069/how-to-get-the-rank-of-a-word-from-a-dictionary-with-word-frequencies-python/22953416#22953416：

>>> from collections import Counter
>>> text = """foo foo bar bar foo bar hello bar hello world  hello world hello world hello world  hello world hello hello hello"""
>>> dictionary = Counter(text.split())
>>> dictionary
{"foo":3, "bar":4, "hello":9, "world":5}
>>> dictionary.most_common()
[('hello', 9), ('world', 5), ('bar', 4), ('foo', 3)]
>>> [i[0] for i in dictionary.most_common()]
['hello', 'world', 'bar', 'foo']

【讨论】：

【解决方案3】：

这个答案是旧的。请改用this answer。

为了解决这个问题，我建议采取以下步骤：

1.检查您使用的是哪个版本的nltk：

>>> import nltk
>>> print nltk.__version__
2.0.4  # preferably 2.0 or higher

旧版本的nltk 没有可排序的FreqDist.keys 方法。

2。确认您没有无意中修改了text1 或vocabulary1：

打开一个新的shell并从头开始重新开始该过程：

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> from nltk import FreqDist
>>> fdist1 = FreqDist(text1)
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']

请注意，vocabulary1 不应包含字符串 u'succour'（原始帖子输出中的第一个 unicode 字符串）：

>>> vocabulary1.count(u'succour')  # vocabulary1 does **not** contain the string u'succour'
0

3.如果您仍然遇到问题，请检查您的源代码和文本列表，以确保它们与您在下面看到的内容相符：

>>> import inspect
>>> print inspect.getsource(FreqDist.keys)  # make sure your source code matches the source code below
    def keys(self):
        """
        Return the samples sorted in decreasing order of frequency.

        :rtype: list(any)
        """
        self._sort_keys_by_value()
        return map(itemgetter(0), self._item_cache)

>>> print inspect.getsource(FreqDist._sort_keys_by_value)  # and matches this source code
    def _sort_keys_by_value(self):
        if not self._item_cache:
            self._item_cache = sorted(dict.items(self), key=lambda x:(-x[1], x[0]))  # <= check this line especially

>>> text1[:40]  # does the first part of your text list match this one?
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')', 'The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ',', 'heart', ',', 'body', ',', 'and', 'brain', ';', 'I', 'see', 'him']

>>> text1[-40:]  # and what about the end of your text list?
['second', 'day', ',', 'a', 'sail', 'drew', 'near', ',', 'nearer', ',', 'and', 'picked', 'me', 'up', 'at', 'last', '.', 'It', 'was', 'the', 'devious', '-', 'cruising', 'Rachel', ',', 'that', 'in', 'her', 'retracing', 'search', 'after', 'her', 'missing', 'children', ',', 'only', 'found', 'another', 'orphan', '.']

如果您的源代码或文本列表与上述不完全匹配，请考虑使用最新的稳定版本重新安装nltk。

【讨论】：

我按照你说的做了，得到了不同的输出，但还是不对:([u'funereal', u'unscientific', u'divinely', u'foul', u'four', u'gag', u'prefix', u'woods', u'clotted', u'Duck', u'hanging', u'plaudits', u'woody', u'Until', u'marching', u'disobeying', u'canes', u'granting', u'advantage', u'Westers', u'insertion', u'DRYDEN', u'formless', u'Untried', u'superficially', u'Western', u'portentous', u'meadows', u'sinking', u'Ding', u'Spurn', u'treasuries', u'churned', u'oceans', u'invasion', u'powders', u'tinkerings', u'tantalizing', u'yellow'...]
@user3528925 很抱歉听到这个消息。我在上面的答案中添加了您可能会采取的四个进一步的步骤来解决此问题。告诉我进展如何。
正文的开头和结尾和你展示的完全一样。我尝试运行您编写的第一部分，但它给了我一个语法错误
@user3528925 确保只复制每个inspect.getsource() 调用的第一行。所以这将是整个脚本：第 1 行：import inspect，第 2 行：from nltk import FreqDist，第 3 行：print inspect.getsource(FreqDist.keys)。
@user3528925 出于好奇，当您输入import nltk 然后（在下一行）print nltk.__version__ 时，您会得到什么输出？

【解决方案4】：

import nltk
fdist1 = nltk.FreqDist(text)

fdist1 包含 'key' - 用于单词，'values' - 用于单词的频率计数。

上面的变量 fdist1 没有排序，因此它不会根据命令打印前 50 个结果。请使用以下代码先对它们进行排序：

sorted_fdist1 = sorted(fdist1 , key = fdist1.__getitem__, reverse = True)
sorted_fdist1[0:50]

这将打印出前 50 个常用词。

【讨论】：

什么是freq_dist？
@NadavB 这是一个错字。试试这个sorted_fdist1 = sorted(fdist1 , key = fdist1.__getitem__, reverse = True)。我刚刚在答案中解决了这个问题！