带有排序频率的python单词计数器答案

【问题标题】：python word counter w/ sorted frequency带有排序频率的python单词计数器
【发布时间】：2015-12-05 22:05:16
【问题描述】：

我正在尝试读取一个文本文件，然后打印出所有最常用单词在顶部的单词，随着列表的下降而减少。我有 Python 3.3.2。

def wordCounter(thing):
# Open a file
    file = open(thing, "r+")
    newWords={}
    for words in file.read().split():
        if words not in newWords:
            newWords[words] = 1
        else:
            newWords[words] += 1

    for k,v in frequency.items():
        print (k, v)
    file.close()

现在，它确实会按照我想要的 /way/ 打印出所有内容，但是有些单词的使用量比列表中较低的其他单词多。我试过使用 newWords.sort()，但它说：

"AttributeError: 'dict' object has no attribute 'sort'"

所以我无所适从，因为我的知识非常有限。

【问题讨论】：

输入文件是什么样的？
字典没有sort()，但您可以将它们传递给sorted()。

标签： python counter frequency

【解决方案1】：

不要重新发明轮子，collections.Counter 将使用 .most_common 进行计数和排序，这将为您提供最常用到最不常用的单词：

from collections import Counter
def wordCounter(thing):
   with open(thing) as f:
       cn = Counter(w for line in f for w in line.split())
       return cn.most_common()

您也不需要将整个文件读入内存，您可以逐行迭代并拆分每一行。您还必须考虑标点符号，您可以使用str.strip 将其去掉：

def wordCounter(thing):
    from string import punctuation
    with open(thing) as f:
        cn = Counter(w.strip(punctuation) for line in f for w in line.split())
        return cn.most_common()

【讨论】：

【解决方案2】：

这会首先打印最常用的单词：

from operator import itemgetter

for k, v in sorted(frequency.items(), key=itemgetter(1), reverse=True):
    print(k, v)

key 是一个用于排序的函数。在我们的例子中，itemgetter 检索值，即频率作为排序标准。

没有导入的替代方案：

for k, v in sorted(frequency.items(), key=lambda x: x[1], reverse=True):
    print(k, v)

【讨论】：

你不需要任何进口来做这样的事情。也许还有解释？
@TigerhawkT3 itemgetter 比 lambda 好一点。我认为值得进口。我还在写一些解释。
你也不需要lambda。
我没有使用导出的，我将频率更改为 newWords（因为频率来自尝试使用不起作用的排序），它对我来说很好。谢谢！

【解决方案3】：

你可以试试这个方法：

from collections import Counter

with open('file_name.txt') as f:
    c=Counter(f.read().split())
    print c.most_common()

【讨论】：

【解决方案4】：

字典没有sort() 方法。但是，您可以将字典传递给内置函数sorted()，它将生成字典键的list。使用带有返回字典键值的函数的排序键，即get() 方法。

for key in sorted(newWords, key=newWords.get):
    print(key, newWords[key])

此外，您似乎一直在进行一些重构，因为您的代码中没有定义 frequency。

【讨论】：

【解决方案5】：

如果您想在没有任何导入的情况下进行排序：

word_count = sorted(new_words.items(), key=lambda x: x[1], reverse=True)

注意：使用正则表达式打印出所有单词是更好的方法：

import re
from collections import defaultdict

word_count = defaultdict(int)
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
file = open("file.txt", 'r')
for line in file:
   for word in pattern.findall(line):
                word_count[word] += 1

【讨论】：