Python：查找最常见字符串的最有效方法答案

【问题标题】：Python: Most efficient method to find the most common stringPython：查找最常见字符串的最有效方法
【发布时间】：2018-05-23 19:58:53
【问题描述】：

我想找出一个国家/地区最常见的 20 个名字及其出现频率。

假设我有 100 个城市中所有居民的名字的列表。每个列表可能包含很多名称。假设我们谈论 100 个列表，每个列表包含 1000 个字符串。

在整个国家/地区获取 20 个最常见的名称及其频率的最有效方法是什么？

这是我开始的方向，假设我将每个城市放在同一目录下的文本文件中：

为此使用 pandas 和 collection 模块。
遍历每个 city.txt，使其成为一个字符串。然后，使用Counter 模块将其转换为集合，然后转换为DataFrame（使用to_dict）。
将每个 DataFrame 与前一个合并。
然后，对 DataFrame 进行分组和计数 (*)。

但是，我认为这种方法可能行不通，因为 DataFrame 可能会变得太大。

想听听这方面的任何建议。谢谢你。

【问题讨论】：

你可以使用来自collections的Counter类
Pandas 为此提供了value_counts()。
我们在这里谈论多少数据？
我正在为每个列表使用它（请参阅 2，我将对其进行编辑以澄清这一点）。但是，当您有 100 个列表，每个列表包含 100 个字符串时，您会怎么做？
请向我们展示您已经尝试过的代码，然后我们可以帮助您。本网站并非旨在为您提供学校作业的答案。

标签： python performance list pandas

【解决方案1】：

这是一个示例代码：

import os
from collections import Counter

cities = [i for i in os.listdir(".") if i.endswith(".txt")]

d = Counter()     

for file in cities:
    with open(file) as f:
        # Adjust the code below to put the strings in a list
        data = f.read().split(",")
        d.update(Counter(data))

out = d.most_common(10)
print(out)

【讨论】：

这是最快的方法吗？有更快的吗？
@Smithnson 是的，如果您使用 Python，这可能是最快的方法。

【解决方案2】：

您也可以使用NLTK 库，我将下面的代码用于类似目的。

from nltk import FreqDist
fd = FreqDist(text)    
top_20 = fd.most_commmon(20)        # it's done, you got top 20 tokens :)

【讨论】：