在python中计算多维数组中数组的出现次数答案

【问题标题】：count occurrences of arrays in multidimensional arrays in python在python中计算多维数组中数组的出现次数
【发布时间】：2016-01-19 00:33:16
【问题描述】：

我有以下类型的数组：

a = array([[1,1,1],
           [1,1,1],
           [1,1,1],
           [2,2,2],
           [2,2,2],
           [2,2,2],
           [3,3,0],
           [3,3,0],
           [3,3,0]])

我想统计每种类型数组的出现次数如

[1,1,1]:3, [2,2,2]:3, and [3,3,0]: 3

如何在 python 中实现这一点？是否可以不使用 for 循环并计入字典？它必须很快，并且应该少于 0.1 秒左右。我查看了 Counter、numpy bincount 等。但是，这些是针对单个元素而不是针对数组的。

谢谢。

【问题讨论】：

Find unique rows in numpy.array的可能重复

标签： python arrays numpy multidimensional-array

【解决方案1】：

由于numpy-1.13.0，np.unique 可以与axis 参数一起使用：

>>> np.unique(a, axis=0, return_counts=True)

(array([[1, 1, 1],
        [2, 2, 2],
        [3, 3, 0]]), array([3, 3, 3]))

【讨论】：

【解决方案2】：

numpy_indexed 包（免责声明：我是它的作者）包含用于此类操作的高效矢量化功能：

import numpy_indexed as npi
unique_rows, row_count = npi.count(a, axis=0)

请注意，这适用于任何维度或数据类型的数组。

【讨论】：

完美答案，但我们如何才能将其作为一个输出：示例 [[[1,1,9],10],[[1,1,0],2]]
zip(*npi.count(..) 会给出那个；但这不会是非常 numpythonic；或者，如果您坚持，您可以使用复合 dtype 制作一个结构化数组并将结果分配给它. 但是，如果你坚持 numpy 喜欢在本地组织事物的方式，你最终可能会得到更有效的解决方案。

【解决方案3】：

您可以使用np.ravel_multi_index 将元素作为二维索引将这些行转换为一维数组。然后，使用np.unique 为我们提供每个唯一行的开始位置，并且还有一个可选参数return_counts 为我们提供计数。因此，实现看起来像这样 -

def unique_rows_counts(a):

    # Calculate linear indices using rows from a
    lidx = np.ravel_multi_index(a.T,a.max(0)+1 )

    # Get the unique indices and their counts
    _, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)

    # return the unique groups from a and their respective counts
    return a[unq_idx], counts

示例运行 -

In [64]: a
Out[64]: 
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [2, 2, 2],
       [2, 2, 2],
       [2, 2, 2],
       [3, 3, 0],
       [3, 3, 0],
       [3, 3, 0]])

In [65]: unqrows, counts = unique_rows_counts(a)

In [66]: unqrows
Out[66]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 0]])
In [67]: counts
Out[67]: array([3, 3, 3])

基准测试

假设您可以使用 numpy 数组或集合作为输出，可以对目前提供的解决方案进行基准测试，就像这样 -

函数定义：

import numpy as np
from collections import Counter

def unique_rows_counts(a):
    lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
    _, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
    return a[unq_idx], counts

def map_Counter(a):
    return Counter(map(tuple, a))    

def forloop_Counter(a):      
    c = Counter()
    for x in a:
        c[tuple(x)] += 1
    return c

时间安排：

In [53]: a = np.random.randint(0,4,(10000,5))

In [54]: %timeit map_Counter(a)
10 loops, best of 3: 31.7 ms per loop

In [55]: %timeit forloop_Counter(a)
10 loops, best of 3: 45.4 ms per loop

In [56]: %timeit unique_rows_counts(a)
1000 loops, best of 3: 1.72 ms per loop

【讨论】：

令人惊讶的是，包括问题在内的所有答案都被否决了，令人惊叹！
@PadraicCunningham 我可以添加运行时测试，但是 OP 可能希望有一个 dict 作为输出，所以这不公平:)
我将不得不等待投票者发布他们的优越解决方案;)
@PadraicCunningham 未测试正确性，但类似于 - base = a.min(0); lidx = np.ravel_multi_index(a.T- base.T[:,None],a.max(0)-base+1 )。
@PadraicCunningham 啊，我明白了，很高兴知道它仍在使用！

【解决方案4】：

collections.Counter 可以很方便地做到这一点，几乎就像example given。

>>> from collections import Counter
>>> c = Counter()
>>> for x in a:
...   c[tuple(x)] += 1
...
>>> c
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

这会将每个子列表转换为一个元组，它可以是字典中的键，因为它们是不可变的。列表是可变的，因此不能用作字典键。

为什么要避免使用 for 循环？

和@padraic-cunningham's much cooler answer类似：

>>> Counter(tuple(x) for x in a)
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
>>> Counter(map(tuple, a))
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

【讨论】：

应避免使用 numpy 的 Python 循环，因为它们可能比 numpy 解决方案慢很多倍。因此，如果您使用 numpy 和 python 循环，您很可能“做错了”。在这两个 python 答案中，您都将整个数组复制到一个不太紧凑的数据类型，这尤其令人担忧。
@Ophion。你有更快的 numpy 解决方案吗？
@PadraicCunningham Divakar 的解决方案对于合理规模的问题会更快。过去，这个问题已被多次询问、回答和基准测试。我认为规范的答案是here。
@Ophion 回复：“将整个数组复制到一个不太紧凑的数据类型”。那是个很好的观点。不过，我曾想过通过使用哈希来进一步担心。但后来也认为它是不必要的复杂。我要在答案中添加的一个编辑/添加是检索某个 elem 的计数，在这种情况下，这本质上是一个 dict 查找。因此，要打印（例如）其中一个计数，而不是print c[(3, 3, 0)]，可以输入print c.getcounts(3, 3, 0)（或用getcounts 的代码覆盖__getitem__）并获得类似的结果。
在我的解决方案中为相当大的输入数组案例添加了基准。对于中型投入，这种基准测试趋势也继续存在。当@Ophion 谈到 NumPy 解决方案针对符合其理念的此类案例进行优化时，我认为他说得很有道理。

【解决方案5】：

如果您不介意映射到元组只是为了获得计数，您可以使用一个 Counter dict，它在我的机器上运行 28.5 µs，使用的 python3 远低于您的阈值：

In [5]: timeit Counter(map(tuple, a))
10000 loops, best of 3: 28.5 µs per loop

In [6]: c = Counter(map(tuple, a))

In [7]: c
Out[7]: Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

【讨论】：

没有问题，在python2中使用map实际上又更快了