获取numpy数组模式的最快方法是什么答案

【问题标题】：what is the fastest way to get the mode of a numpy array获取numpy数组模式的最快方法是什么
【发布时间】：2019-03-06 12:25:13
【问题描述】：

我必须找到从 hdf5 文件中读取的 NumPy 数组的模式。 NumPy 数组是 1d 并且包含浮点值。

my_array=f1[ds_name].value    
mod_value=scipy.stats.mode(my_array)

我的数组是 1d 并且包含大约 1M 的值。我的脚本需要大约 15 分钟才能返回模式值。有什么办法可以加快速度吗？

另一个问题是为什么scipy.stats.median(my_array) 在模式有效时不起作用？

AttributeError: 模块 'scipy.stats' 没有属性 'median'

【问题讨论】：

听起来是 IO 绑定的，因为我认为剩余的代码是最佳的。所以检查你的hdf；缓冲区，压缩和合作。另外： scipy.stats 没有称为中值的函数，可以通过阅读文档轻松检查。您可以只使用 numpy 的中位数。
@sascha 文件在 0.02 秒内被读取。在这行代码“scipy.stats.median(my_array)”中计算模式花费了 15 分钟。
也许您应该显示更多代码，因为您的时间与给定答案中显示的这些合成示例有很大不同（这也表明我错了；实现的加速并不那么难)。
@sascha 请检查添加的回复以获取完整代码。如果您需要我的输入文件进行测试，请告诉我，谢谢。

标签： numpy scipy

【解决方案1】：

scipy.stats.mode 的实现有一个 Python 循环，用于处理带有多维数组的 axis 参数。下面的简单实现，仅适用于一维数组，速度更快：

def mode1(x):
    values, counts = np.unique(x, return_counts=True)
    m = counts.argmax()
    return values[m], counts[m]

这是一个例子。首先，创建一个长度为 1000000 的整数数组。

In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)

In [41]: x.shape
Out[41]: (1000000,)

检查scipy.stats.mode 和mode1 是否给出相同的结果。

In [42]: from scipy.stats import mode

In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))

In [44]: mode1(x)
Out[44]: (1009, 1066)

现在检查性能。

In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

mode(x) 为 2.91 秒，mode1(x) 仅为 39.6 毫秒。

【讨论】：

只是为了好玩，我尝试了Counter。运行时间大约是 scs.mode 的一半，但您的解决方案很容易做到。

【解决方案2】：

这是一种基于排序的方法 -

def mode1d(ar_sorted):
    ar_sorted.sort()
    idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
    count = np.empty(idx.size+1,dtype=int)
    count[1:-1] = idx[1:] - idx[:-1]
    count[0] = idx[0] + 1
    count[-1] = ar_sorted.size - idx[-1] - 1
    argmax_idx = count.argmax()

    if argmax_idx==len(idx):
        modeval = ar_sorted[-1]
    else:
        modeval = ar_sorted[idx[argmax_idx]]
    modecount = count[argmax_idx]
    return modeval, modecount

请注意，这会在对输入数组进行排序时对其进行变异/更改。因此，如果您想保持输入数组未变异或介意输入数组被排序，请传递一个副本。

在 1M 元素上运行示例 -

In [65]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)

In [66]: from scipy.stats import mode

In [67]: mode(x)
Out[67]: ModeResult(mode=array([ 295.]), count=array([1098]))

In [68]: mode1d(x)
Out[68]: (295.0, 1098)

运行时测试

In [75]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)

# Scipy's mode
In [76]: %timeit mode(x)
1 loop, best of 3: 1.64 s per loop

# @Warren Weckesser's soln
In [77]: %timeit mode1(x)
10 loops, best of 3: 52.7 ms per loop

# Proposed in this post
In [78]: %timeit mode1d(x)
100 loops, best of 3: 12.8 ms per loop

有了副本，mode1d 的时间将与 mode1 相当。

【讨论】：

【解决方案3】：

我将上面回复中的两个函数 mode1 和 mode1d 添加到我的脚本中，并尝试与 scipy.stats.mode 进行比较。

dir_name="C:/Users/test_mode"
file_name="myfile2.h5"
ds_name="myds"
f_in=os.path.join(dir_name,file_name)

def mode1(x):
    values, counts = np.unique(x, return_counts=True)
    m = counts.argmax()
    return values[m], counts[m]

def mode1d(ar_sorted):
    ar_sorted.sort()
    idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
    count = np.empty(idx.size+1,dtype=int)
    count[1:-1] = idx[1:] - idx[:-1]
    count[0] = idx[0] + 1
    count[-1] = ar_sorted.size - idx[-1] - 1
    argmax_idx = count.argmax()

    if argmax_idx==len(idx):
        modeval = ar_sorted[-1]
    else:
        modeval = ar_sorted[idx[argmax_idx]]
    modecount = count[argmax_idx]
    return modeval, modecount


startTime=time.time()
with h5py.File(f_in, "a") as f1:

        myds=f1[ds_name].value
        time1=time.time()
        file_read_time=time1-startTime
        print(str(file_read_time)+"\t"+"s"+"\t"+str((file_read_time)/60)+"\t"+"min")

        print("mode_scipy=")
        mode_scipy=scipy.stats.mode(myds)
        print(mode_scipy)
        time2=time.time()
        mode_scipy_time=time2-time1
        print(str(mode_scipy_time)+"\t"+"s"+"\t"+str((mode_scipy_time)/60)+"\t"+"min")

        print("mode1=")
        mode1=mode1(myds)
        print(mode1)
        time3=time.time()
        mode1_time=time3-time2
        print(str(mode1_time)+"\t"+"s"+"\t"+str((mode1_time)/60)+"\t"+"min")

        print("mode1d=")
        mode1d=mode1d(myds)
        print(mode1d)
        time4=time.time()
        mode1d_time=time4-time3
        print(str(mode1d_time)+"\t"+"s"+"\t"+str((mode1d_time)/60)+"\t"+"min")

为大约 1M 的 numpy 数组运行脚本的结果是：

mode_scipy= ModeResult(mode=array([1.11903353e-06], dtype=float32), count=array([304909])) 938.8368742465973 秒
15.647281237443288 分钟

mode1=(1.1190335e-06, 304909)

0.06500649452209473 秒
0.0010834415753682455 分钟

mode1d=(1.1190335e-06, 304909)

0.06200599670410156 秒
0.0010334332784016928 分钟

【讨论】：