计算元组列表中项目的频率答案

【问题标题】：Count frequency of item in a list of tuples计算元组列表中项目的频率
【发布时间】：2018-05-30 08:01:54
【问题描述】：

我有一个如下所示的元组列表。我必须计算有多少项目的数字大于 1。到目前为止我编写的代码非常慢。即使有大约 10K 元组，如果你看到下面的示例字符串出现两次，所以我必须得到这样的字符串。我的问题是通过迭代生成器来实现字符串计数的最佳方法是什么

列表：

 b_data=[('example',123),('example-one',456),('example',987),.....]

到目前为止我的代码：

blockslst=[]
for line in b_data:
    blockslst.append(line[0])

blocklstgtone=[]
for item in blockslst:
    if(blockslst.count(item)>1):
        blocklstgtone.append(item)

【问题讨论】：

顺便说一句，这不是生成器表达式，它是一个列表。

标签： python python-3.x list tuples generator

【解决方案1】：

从每个元组中提取第一项的想法是正确的。您可以使用列表/生成器理解使您的代码更简洁，如下所示。

从那时起，查找元素频率计数的最惯用方式是使用collections.Counter 对象。

从元组列表中提取第一个元素（使用推导式）
将此传递给Counter
example 的查询计数

from collections import Counter

counts = Counter(x[0] for x in b_data)
print(counts['example'])

当然，您可以使用list.count，如果它只是您想要查找频率计数的一个项，但在一般情况下，Counter 是要走的路。

Counter 的优点是它在线性 (O(N)) 时间内执行所有元素（不仅仅是 example）的频率计数。假设您还想查询另一个元素的计数，例如foo。这将通过 -

来完成

print(counts['foo'])

如果列表中不存在'foo'，则返回0。

如果你想找到最常见的元素，请致电counts.most_common -

print(counts.most_common(n))

其中n 是您要显示的元素数。如果你想看到一切，请不要传递n。

要检索最常见元素的计数，一种有效的方法是查询most_common，然后使用itertools 有效地提取计数超过1 的所有元素。

from itertools import takewhile

l = [1, 1, 2, 2, 3, 3, 1, 1, 5, 4, 6, 7, 7, 8, 3, 3, 2, 1]
c = Counter(l)

list(takewhile(lambda x: x[-1] > 1, c.most_common()))
[(1, 5), (3, 4), (2, 3), (7, 2)]

（OP 编辑）或者，使用 list comprehension 来获取 count > 1 的项目列表 -

[item[0] for item in counts.most_common() if item[-1] > 1]

请记住，这不如 itertools.takewhile 解决方案高效。例如，如果您有一个计数 > 1 的项目，以及一百万个计数等于 1 的项目，那么您最终会在列表中迭代一百万次，而您不必这样做（因为 most_common按降序返回频率计数）。对于takewhile，情况并非如此，因为一旦 count > 1 的条件变为 false，您就会停止迭代。

【讨论】：

最常用的方法，有没有办法给我所有的字符串，比如 count > 1
@min2bro 是的，为什么不呢。查询most_common，然后循环遍历。让我写一个小答案。
感谢Counter - 不知道那个并像Ayodhyankit Paul一样重新创建了它
@coldspeed，添加列表推导以仅获取计数大于 1 的字符串列表
@cs95 我有一个关于list(takewhile(lambda x: x[-1] > 1, c.most_common()))的问题...在for循环中如何重置计数器？对于我的生活，我无法重置它。我试过 c.clear(), c.update('a'), c = Counter()..但是没有任何东西会重置计数器。它只是不断添加和添加......

【解决方案2】：

第一种方法：

如果没有循环呢？

print(list(map(lambda x:x[0],b_data)).count('example'))

输出：

第二种方法：

您可以使用简单的 dict 进行计算，无需导入任何外部模块或使其变得如此复杂：

b_data = [('example', 123), ('example-one', 456), ('example', 987)]

dict_1={}
for i in b_data:
    if i[0] not in dict_1:
        dict_1[i[0]]=1
    else:
        dict_1[i[0]]+=1

print(dict_1)



print(list(filter(lambda y:y!=None,(map(lambda x:(x,dict_1.get(x)) if dict_1.get(x)>1 else None,dict_1.keys())))))

输出：

[('example', 2)]

测试用例：

b_data = [('example', 123), ('example-one', 456), ('example', 987),('example-one', 456),('example-one', 456),('example-two', 456),('example-two', 456),('example-two', 456),('example-two', 456)]

输出：

[('example-two', 4), ('example-one', 3), ('example', 2)]

【讨论】：

这里唯一的问题是我需要再次遍历 dict_1 字典以找到 count 大于 1 的字符串
@min2bro 我已经更新了解决方案，现在检查一下，你甚至不需要在那里循环或任何外部模块。
我正在寻找所有计数 > 1 的字符串，因此上述解决方案一次只适用于一个字符串计数
您的代码只是Counter 的重新实现。为什么要重新发明轮子？
@min2bro 如果你想要所有字符串，那么请使用我的第二个解决方案，是的，我修复了这个问题，现在你不必遍历 dic_t 字典来查找计数大于 1 的字符串.

【解决方案3】：

我花时间做这件事 ayodhyankit-paul 发布相同的内容 - 将其保留在生成器代码中对于测试用例和时间安排：

创建 100001 个项目大约需要 5 秒，计数大约需要 0.3 秒，计数过滤太快而无法测量（使用 datetime.now() - 不打扰perf_counter） - 总而言之，从开始到结束花了 不到 5.1s 大约 10 倍您操作的数据。

我认为这类似于Counter in COLDSPEEDs answer 所做的：

foreach item in list of tuples:

如果item[0]不在列表中，则放入dict和count of 1
否则increment count 在字典中by 1

代码：

from collections import Counter
import random
from datetime import datetime # good enough for a loong running op


dt_datagen = datetime.now()
numberOfKeys = 100000 


# basis for testdata
textData = ["example", "pose", "text","someone"]
numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant

# create random testdata from above lists
tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)] 

tData.append(("aaa",99))

dt_dictioning = datetime.now()

# create a dict
countEm = {}

# put all your data into dict, counting them
for p in tData:
    if p[0] in countEm:
        countEm[p[0]] += 1
    else:
        countEm[p[0]] = 1

dt_filtering = datetime.now()
#comparison result-wise (commented out)        
#counts = Counter(x[0] for x in tData)
#for c in sorted(counts):
#    print(c, " = ", counts[c])
#print()  
# output dict if count > 1
subList = [x for x in countEm if countEm[x] > 1] # without "aaa"

dt_printing = datetime.now()

for c in sorted(subList):
    if (countEm[c] > 1):
        print(c, " = ", countEm[c])

dt_end = datetime.now()

print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
print( "Printing all the items left took    \t", (dt_end-dt_printing).total_seconds(), " seconds")

print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )

输出：

# reformatted for bevity
example0  =  2520       example1  =  2535       example2  =  2415
example3  =  2511       example4  =  2511       example5  =  2444
example6  =  2517       example7  =  2467       example8  =  2482
example9  =  2501

pose0  =  2528          pose1  =  2449          pose2  =  2520      
pose3  =  2503          pose4  =  2531          pose5  =  2546          
pose6  =  2511          pose7  =  2452          pose8  =  2538          
pose9  =  2554

someone0  =  2498       someone1  =  2521       someone2  =  2527
someone3  =  2456       someone4  =  2399       someone5  =  2487
someone6  =  2463       someone7  =  2589       someone8  =  2404
someone9  =  2543

text0  =  2454          text1  =  2495          text2  =  2538
text3  =  2530          text4  =  2559          text5  =  2523      
text6  =  2509          text7  =  2492          text8  =  2576      
text9  =  2402


Creating  100001  testdataitems took:    4.728604  seconds
Putting them into dictionary took        0.273245  seconds
Filtering donw to those > 1 hits took    0.0  seconds
Printing all the items left took         0.031234  seconds

Total time:      5.033083  seconds

【讨论】：

@COOLDSPEED 在其他答案中提到这是关于 Counter 在内部执行的操作 - 所以不要使用我的，使用 Counter ;) 我猜它会更智能。
我仍然可以欣赏一个很好的答案。点赞，干杯。

【解决方案4】：

让我举个例子让你理解。虽然这个例子与你的例子有很大不同，但我发现它在解决这类问题时很有帮助。

from collections import Counter

a = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
# 
# 1. Lowercase everything
# 2. Split it into words.
# 3. Count the results.

dictionary = Counter(word for i, j in a for word in j.lower().split())

print(dictionary)

# print out every words if the count > 1
[print(word, count) for word, count in dictionary.most_common() if count > 1]

现在这是您以上述方式解决的示例

from collections import Counter
a=[('example',123),('example-one',456),('example',987),('example2',987),('example3',987)]

dict = Counter(word for i,j in a for word in i.lower().split() )

print(dict)

[print(word ,count) for word,count in dict.most_common() if count > 1  ]

【讨论】：