将单词列表转换为频率字典的最佳方法答案

【问题标题】：Best way to turn word list into frequency dict将单词列表转换为频率字典的最佳方法
【发布时间】：2009-04-06 18:55:11
【问题描述】：

将列表/元组转换为 dict 的最佳方法是什么，其中键是列表的不同值，值是这些不同值的频率？

换句话说：

['a', 'b', 'b', 'a', 'b', 'c']
--> 
{'a': 2, 'b': 3, 'c': 1}

（我不得不多次执行上述操作，标准库中有什么可以为您完成的吗？）

编辑：

Jacob Gabrielson 指出 2.7/3.1 分支有 something coming in the standard lib

【问题讨论】：

也许可以定义您所说的“最佳”是什么意思？最有效率？最少的代码？最容易理解？

标签： python

【解决方案1】：

我发现最容易理解（虽然可能不是最有效）的方法是：

{i:words.count(i) for i in set(words)}

【讨论】：

我喜欢 Python 的原因！
@S.Lott : dict 理解是在 2.7 中引入的，而不是 3.0 。
非常优雅......但在最坏（但现实生活）的情况下，接近二次成本。应谨慎使用。

【解决方案2】：

种类

from collections import defaultdict
fq= defaultdict( int )
for w in words:
    fq[w] += 1

这通常效果很好。

【讨论】：

【解决方案3】：

请注意，从 Python 2.7/3.1 开始，此功能将内置到 collections 模块中，请参阅 this bug 了解更多信息。这是来自release notes 的示例：

>>> from collections import Counter
>>> c=Counter()
>>> for letter in 'here is a sample of english text':
...   c[letter] += 1
...
>>> c
Counter({' ': 6, 'e': 5, 's': 3, 'a': 2, 'i': 2, 'h': 2,
'l': 2, 't': 2, 'g': 1, 'f': 1, 'm': 1, 'o': 1, 'n': 1,
'p': 1, 'r': 1, 'x': 1})
>>> c['e']
5
>>> c['z']
0

【讨论】：

看起来比这更简单，看起来您可以将字符串传递给 Counter 构造函数，它会为您完成
你可以简单地做Counter(word_list)。

【解决方案4】：

其实Counter的答案已经提到过，但我们可以做得更好（更简单）！

from collections import Counter
my_list = ['a', 'b', 'b', 'a', 'b', 'c']
Counter(my_list)  # returns a Counter, dict-like object
>> Counter({'b': 3, 'a': 2, 'c': 1})

【讨论】：

【解决方案5】：

这是可憎的，但是：

from itertools import groupby
dict((k, len(list(xs))) for k, xs in groupby(sorted(items)))

我想不出有人会选择这种方法而不是 S.Lott 的原因，但如果有人要指出这一点，那还不如说是我。 :)

【讨论】：

我不得不说我只是这么说并测试了它的性能（我正在查看包含数百万个对象的列表）并认为这必须比重复获取/设置哈希更快-地图......但事实证明，当它必须对列表进行排序时，我的测试需要 4 倍的 CPU 时间，或者当列表已经排序时需要 2 倍。有趣的。不过它非常聪明。
如果您要处理数百万个对象，最好还是使用外部排序（或者如果可能的话，将排序卸载到输入来源的数据引擎）。带壳的sort words.txt | uniq -c 栗子很难被击败。

【解决方案6】：

我决定继续测试建议的版本，我发现 Jacob Gabrielson 建议的 collections.Counter 最快，其次是 SLott 的 defaultdict 版本。

这是我的代码：

from collections import defaultdict
from collections import Counter

import random

# using default dict
def counter_default_dict(list):
    count=defaultdict(int)
    for i in list:
        count[i]+=1
    return count

# using normal dict
def counter_dict(list):
    count={}
    for i in list:
        count.update({i:count.get(i,0)+1})
    return count

# using count and dict
def counter_count(list):
    count={i:list.count(i) for i in set(list)}
    return count

# using count and dict
def counter_counter(list):
    count = Counter(list)
    return count

list=sorted([random.randint(0,250) for i in range(300)])


if __name__=='__main__':
    from timeit import timeit
    print("collections.Defaultdict ",timeit("counter_default_dict(list)", setup="from __main__ import counter_default_dict,list", number=1000))
    print("Dict",timeit("counter_dict(list)",setup="from __main__ import counter_dict,list",number=1000))
    print("list.count ",timeit("counter_count(list)", setup="from __main__ import counter_count,list", number=1000))
    print("collections.Counter.count ",timeit("counter_counter(list)", setup="from __main__ import counter_counter,list", number=1000))

我的结果：

collections.Defaultdict 
0.06787874956330614
Dict
 0.15979115872995675
list.count 
 1.199258431219126
collections.Counter.count
 0.025896202538920665

请告诉我如何改进分析。

【讨论】：

【解决方案7】：

我必须分享一个我刚刚想出的有趣但有点荒谬的做法：

>>> class myfreq(dict):
...     def __init__(self, arr):
...         for k in arr:
...             self[k] = 1
...     def __setitem__(self, k, v):
...         dict.__setitem__(self, k, self.get(k, 0) + v)
... 
>>> myfreq(['a', 'b', 'b', 'a', 'b', 'c'])
{'a': 2, 'c': 1, 'b': 3}

【讨论】：

(self.get(k) or 0) 最好写成 self.get(k, 0)

【解决方案8】：

我认为使用集合库是获取它的最简单方法。但是如果你想在不使用它的情况下获取频率字典，那么它是另一种方式，

l = [1,4,2,1,2,6,8,2,2]
d ={}
for i in l:
    if i in d.keys():
        d[i] = 1 + d[i]
    else:
        d[i] = 1
print (d)

操作：

{1: 2, 4: 1, 2: 4, 6: 1, 8: 1}

【讨论】：