沿最后一个轴的 3D numpy 数组中最常见和第二常见的值答案

【问题标题】：Most common and second most common values in 3D numpy array along last axis沿最后一个轴的 3D numpy 数组中最常见和第二常见的值
【发布时间】：2019-11-11 07:53:05
【问题描述】：

我想知道我的 numpy 数组最后一个轴上最常见的 2 个值及其频率。我已经有这个工作，但我想让它运行得更快。

示例案例

真实数据是一个 (720, 1280, 64) 形状的 uint16 类型的 numpy 数组，但为了简单起见，我们假设它是一个 (2, 2, 4) 数组。

所以数据会是这样的：

               0          1  
          ------------------------
        0 | [1,1,1,2] [1,1,2,2]
        1 | [2,2,2,1] [1,1,1,3]

对于每个 x、y 位置，我想知道最常见和第二常见的值是什么，以及最常见和第二常见的值出现了多少次（如果两个值同样常见，则选择其中一个没问题）。

所以对于上面的例子，最常见的值是：

               0          1  
          ------------------------
        0 |    1          1
        1 |    2          1

以及它们出现了多少次：

               0          1  
          ------------------------
        0 |    3          2
        1 |    3          3

示例中的第二个最常见的值（如果没有第二个最常见的值，则放零即可）是：

               0          1  
          ------------------------
        0 |    2          2
        1 |    1          3

以及第二常见值出现的频率。如果没有第二个最常见的值，那么在这里放任何东西都可以。

               0          1  
          ------------------------
        0 |    1          2
        1 |    1          1

当前解决方案

如果数组被称为“a”，我首先这样做是为了得到最常见的值及其出现次数：

import numpy as np
from scipy.stats import mode

a = np.array([
    [[1,1,1,2], [1,1,2,2]],
    [[2,2,2,1], [1,1,1,3]]
])

most_common_value, most_common_count = mode(a, axis=2)
print(most_common_value.squeeze())
print(most_common_count.squeeze())

输出：

[[1 1]
 [2 1]]

[[3 2]
 [3 3]]

然后要获得第二个最常见的值，我只需删除最常见的值，然后再次运行上述。要删除，我首先创建一个要删除的值的掩码。

mask = a == most_common_value
print(mask)

输出：

array([[[ True,  True,  True, False],
        [ True,  True, False, False]],

       [[ True,  True,  True, False],
        [ True,  True,  True, False]]])

现在我真正想要的是删除所有为真的东西，但由于尺寸必须沿轴保持不变，而不是实际删除任何东西，我用 NaN 替换最常见的值。

由于这些是不知道 NaN 的 uint16，我必须先转换为浮点数。

a = a.astype('float')
np.putmask(a, mask, np.nan)
print(a)

输出：

[[[nan nan nan  2.]
  [nan nan  2.  2.]]

 [[nan nan nan  1.]
  [nan nan nan  3.]]]

现在mode 可以在此再次运行，但需要告知它忽略 NaN，并且需要再次将结果转换为 uint16。

m = mode(a, axis=2, nan_policy='omit')
m = [x.astype('uint16') for x in m]
second_most_common_value, second_most_common_count = m
print(second_most_common_value.squeeze())
print(second_most_common_count.squeeze())

输出：

[[2 2]
 [1 3]]

[[1 2]
 [1 1]]

此时我已经有了所有最常见和第二常见的值以及它们在轴上出现的次数，所以我完成了。

性能

正如我所提到的，此解决方案有效，但速度很慢。这是上面重复的内容，但作为具有真实数据的脚本，您可以尝试运行。我也put it up on pastebin，以防更容易复制。

独立示例：

import time
import numpy as np
from scipy.stats import mode

a = np.random.randint(30000, size=(720, 1280, 64))

start_time = time.time()

most_common_value, most_common_count = mode(a, axis=2)

mask = a == most_common_value
a = a.astype('float')
np.putmask(a, mask, np.nan)

m = mode(a, axis=2, nan_policy='omit')
m = [x.astype('uint16') for x in m]
second_most_common_value, second_most_common_count = m

end_time = time.time()
print(f'Took {end_time-start_time:.2f} seconds to run')

输出：

Took 123.29 seconds to run

理想情况下，这应该在 30 秒内运行，但欢迎任何改进。

你为什么要这样做？

您可能已经注意到，(720, 1280, 64) 的前两个维度是 1280x720 的图像分辨率。每个像素的 64 个值是该像素下的子像素的颜色，并通过索引引用已知的调色板。

要知道如何为每个像素着色，我需要知道两种最常见的调色板颜色，以便混合它们。数据来自我创建的场景中的 Blender，所以我知道每个像素几乎总是只有两种不同的调色板颜色。

这个项目的重点是提高my website 的渲染质量，用户可以在其中创建即时自定义动画；解决这个问题将消除渲染中的锯齿状边缘。

由于我的动画有 600 帧，因此每帧运行它大约需要一天的时间，而且我希望能够在睡觉时开始运行它并在早上获得完成的结果，所以出于这个原因，我想稍微提高性能。

【问题讨论】：

标签： python python-3.x numpy scipy

【解决方案1】：

我最终编写了一个简单的模式，它遍历最后一个轴的所有值，尝试每个值，看看它是否可能是新模式。对于我的数据，这个简单的解决方案最终仍然是 scipy.stats.mode 的两倍。

def silly_mode(a):
    """Returns mode and counts for final axis of a numpy array.

    Same as scipy.stats.mode(a, axis=-1).squeeze()
    """

    # Best mode candidate discovered so far
    most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)

    # Silly solution based on just iterating all of final dimension,
    # but still beats scipy if final dimension is less than 100 in length.
    for i in range(0, a.shape[2]):

        # Find candidate value for each cell
        val = np.expand_dims(a[:,:,i], axis = -1)

        # Count how many times it appears
        counts = np.count_nonzero(a == val, axis = -1).astype(a.dtype)

        # Find out which ones should become the new mode values
        update_mask = counts > most_common_count[:,:]

        # Update mode value and its count where necessary
        np.putmask(most_common_value, update_mask, val)
        np.putmask(most_common_count, update_mask, counts)

    return most_common_value, most_common_count

这有一个额外的好处，即可以扩展到查找第二个最常见的值，我认为这将比我使用 scipy 采用模式、删除模式值然后找到再次模式。

一旦我开始工作，我将用找到第二个最常见值的方式更新这个答案。

更新：

这是查找前 2 个最常见值及其计数的函数。我不会在任何关键的事情上依赖它，因为除了一些测试用例之外，它还没有经过适当的测试。

def top_2_most_common_values(a, ignore_zeros = False):
    """Returns mode and counts for each mode of final axis of a numpy array,
    and also returns the second most common value and its counts.

    Similar to calling scipy.stats.mode(a, axis=-1).squeeze() to find the mode,
    except this also returns the second most common values.

    If ignore_zeros is true, then zero will not be considered as a mode candidate.
    In this case a zero instead signifies that there was no most common or second
    most common value, and so the count will also be zero.
    """

    # Silly solution based on just iterating all of final dimension
    most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    second_most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    second_most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)

    for i in range(0, a.shape[2]):

        # Find candidate value for each cell
        val = np.expand_dims(a[:,:,i], axis = -1)

        # Count how many times it appears
        counts = np.count_nonzero(a == val, axis = -1).astype(a.dtype)

        # Find out which ones should become the new mode values
        update_mask = counts > most_common_count[:,:]
        if ignore_zeros:
            update_mask &= val.squeeze() != 0

        # If most common value changes, then what used to be most common is now second most common
        # Without the next two lines a like [1,2,2] would fail, as the second most common value
        # is never encountered again after being initially set to be the most common one.
        np.putmask(second_most_common_value, update_mask, most_common_value)
        np.putmask(second_most_common_count, update_mask, most_common_count)        

        # Update mode value and its count where necessary
        np.putmask(most_common_value, update_mask, val)
        np.putmask(most_common_count, update_mask, counts)

        # In a case like [2, 0, 0, 1] the last 1 isn't the new most common value, but it 
        # still should be updated as the second most common value. For these cases separately check 
        # if any encountered value might be the second most common one.
        update_mask = (counts >= second_most_common_count[:,:]) & (val.squeeze() != most_common_value[:,:])
        if ignore_zeros:
            update_mask &= val.squeeze() != 0

        # # Save previous best mode and its count before updating
        np.putmask(second_most_common_value, update_mask, val)
        np.putmask(second_most_common_count, update_mask, counts)

    return most_common_value, most_common_count, second_most_common_value, second_most_common_count

【讨论】：