多维numpy数组中列表的频率表答案

【问题标题】：Frequency table of lists within multidimensional numpy array多维numpy数组中列表的频率表
【发布时间】：2018-06-15 10:28:57
【问题描述】：

我有一些（很多）二进制编码向量，例如：

[0, 1, 0, 0, 1, 0] #But with many more elements each one

它们都存储在一个 numpy (2D) 数组中，例如：

[
 [0, 1, 0, 0, 1, 0],
 [0, 0, 1, 0, 0, 1],
 [0, 1, 0, 0, 1, 0],
]

我想得到一个每个标签集的频率表。因此，在本例中，频率表将是：

[2,1]

因为第一个标签集有两个外观，而第二个标签集只有一个。

换句话说，我想从 Scipy 实现 itemfreq 或从 numpy 实现 histogram，但不是针对单个元素，而是针对列表。

现在我实现了以下代码：

def get_label_set_freq_table(labels):
    uniques = np.empty_like(labels)
    freq_table = np.zeros(shape=labels.shape[0])
    equal = False

    for idx,row in enumerate(labels):
        for lbl_idx,label_set in enumerate(uniques):
            if np.array_equal(row,label_set):
                equal = True
                freq_table[lbl_idx] += 1
                break
        if not equal:
            uniques[idx] = row
            freq_table[idx] += 1
        equal = False

    return freq_table

作为标签二进制编码向量。

效果很好，但是当向量的数量很大（>58.000）并且每个向量中的元素数量也很大（>8.000）时，它会非常低

如何以更有效的方式做到这一点？

【问题讨论】：

这对我来说看起来并不热门。
你说得对，我会将问题编辑为“二进制”向量。谢谢。 @Divakar 也有同样的赞赏。

标签： python performance numpy scipy

【解决方案1】：

我假设你的意思是一个只有 1 和 0 的数组。对于这些，我们可以使用 二进制缩放 将每一行缩减为一个标量，然后使用 np.unique -

In [52]: a
Out[52]: 
array([[0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]])

In [53]: s = 2**np.arange(a.shape[1])

In [54]: a1D = a.dot(s)

In [55]: _, start, count = np.unique(a1D, return_index=1, return_counts=1)

In [56]: a[start]
Out[56]: 
array([[0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1]])

In [57]: count
Out[57]: array([2, 1])

这是一个通用的 -

In [33]: unq_rows, freq = np.unique(a, axis=0, return_counts=1)

In [34]: unq_rows
Out[34]: 
array([[0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]])

In [35]: freq
Out[35]: array([1, 2])

【讨论】：

我忘了轴参数...哇！你的解决方案很棒！如此高效，如此优雅！非常感谢！检查并像魅力一样工作，接受答案！