二维numpy数组中元素的频率答案

【问题标题】：Frequencies of elements in 2D numpy array二维numpy数组中元素的频率
【发布时间】：2021-07-20 16:01:15
【问题描述】：

我有一个形状为(1000,4) 的numpy 数组output。它是一个包含 1000 个没有重复的四元组的数组，它们是有序的（即元素是 [0,1,2,3]）。我想数一数我有多少次得到所有可能的四倍。更实际的是，我使用以下代码：

comb=np.array(list(itertools.combinations(range(32),4)))
def counting(comb, output):
    k=0
    n_output=np.zeros(comb.shape[0])
    for i in range(comb.shape[0]):
        k=0
        for j in range(output.shape[0]):
            if (output[j]==comb[i]).all():
                k+=1
        n_output[i]=k
    return n_output

如何优化代码？目前运行需要 30 秒

【问题讨论】：

试着想出一个不涉及嵌套for循环的方法。
你也可以添加输出数组吗？
输出数组根据特定分布随机填充（比均匀分布或高斯分布更复杂）。所以，它类似于：[[1,2,4,25],...[16,18,20,30]...]

标签： python arrays performance numpy optimization

【解决方案1】：

您当前的实施效率低下有两个原因：

算法复杂度为O(n^2)；
它利用（慢速 CPython）循环。

您使用 Python 集（仍然带有循环）编写了一个简单的 O(n) 算法，因为 output 没有任何重复。结果如下：

def countingFast(comb, output):
    k=0
    n_output=np.zeros(comb.shape[0])
    tmp = set(map(tuple, output))
    for i in range(comb.shape[0]):
        n_output[i] = int(tuple(comb[i]) in tmp)
    return n_output

在我的机器上，使用所描述的输入大小，原始版本需要 55.2 秒，而这个实现需要 0.038 秒。这大约快 1400 倍。

【讨论】：

【解决方案2】：

您可以生成一个布尔数组，表示您要检查的序列是否等于数组中的给定行。由于 numpy 的布尔数组可以求和，因此您可以使用此结果来获取匹配行的总数。

基本方法可能如下所示（包括示例数据生成）：

import numpy as np

# set seed value of random generator to fixed value for repeatable output
np.random.seed(1234)

# create a random array with 950x4 elements
arr = np.random.rand(950, 4)

# create a 50x4 array with sample sequence
# this is the sequence we want to count in our final array
sequence = [0, 1, 2, 3]
sample = np.array([sequence, ]*50)

# stack arrays to create sample data with 1000x4 elements
arr = np.vstack((arr, sample))

# shuffle array to get a random distribution of random sample data and known sequence
np.random.shuffle(arr)

# check for equal array elements, returns a boolean array
results = np.equal(sequence, arr)

# sum the boolean array to get the number of total occurences per axis
# as the sum is the same for all columns, we just need to get the first element at index 0
occurences = np.sum(results, axis=0)[0]

print(occurences)
# --> 50

您需要为您感兴趣的每个序列调用所需的行。因此，编写这样的函数会很有用：

def number_of_occurences(data, sequence):
    results = np.equal(sequence, data)
    return np.sum(results, axis=0)[0]

【讨论】：