加速结构化 NumPy 数组答案

【问题标题】：Speed up structured NumPy array加速结构化 NumPy 数组
【发布时间】：2016-04-28 06:00:43
【问题描述】：

NumPy 数组在性能和易用性方面都非常出色（比列表更容易切片、索引）。

我尝试用NumPy structured array 而非dict 或NumPy arrays 构建数据容器。问题是性能要差得多。使用同质数据的坏率大约是 2.5 倍，对于异构数据来说大约是 32 倍（我说的是NumPy 数据类型）。

有没有办法加快结构化数组的速度？我尝试将内存顺序从“c”更改为“f”，但这没有任何影响。

这是我的分析代码：

import time
import numpy as np

NP_SIZE = 100000
N_REP = 100

np_homo = np.zeros(NP_SIZE, dtype=[('a', np.double), ('b', np.double)], order='c')
np_hetro = np.zeros(NP_SIZE, dtype=[('a', np.double), ('b', np.int32)], order='c')
dict_homo = {'a': np.zeros(NP_SIZE), 'b': np.zeros(NP_SIZE)}
dict_hetro = {'a': np.zeros(NP_SIZE), 'b': np.zeros(NP_SIZE, np.int32)}

t0 = time.time()
for i in range(N_REP):
    np_homo['a'] += i

t1 = time.time()
for i in range(N_REP):
    np_hetro['a'] += i

t2 = time.time()
for i in range(N_REP):
    dict_homo['a'] += i

t3 = time.time()
for i in range(N_REP):
    dict_hetro['a'] += i
t4 = time.time()

print('Homogeneous Numpy struct array took {:.4f}s'.format(t1 - t0))
print('Hetoregeneous Numpy struct array took {:.4f}s'.format(t2 - t1))
print('Homogeneous Dict of numpy arrays took {:.4f}s'.format(t3 - t2))
print('Hetoregeneous Dict of numpy arrays took {:.4f}s'.format(t4 - t3))

编辑：忘记输入我的计时号码：

Homogenious Numpy struct array took 0.0101s
Hetoregenious Numpy struct array took 0.1367s
Homogenious Dict of numpy arrays took 0.0042s
Hetoregenious Dict of numpy arrays took 0.0042s

Edit2：我用 timit 模块添加了一些额外的测试用例：

import numpy as np
import timeit

NP_SIZE = 1000000

def time(data, txt, n_rep=1000):
    def intern():
        data['a'] += 1

    time = timeit.timeit(intern, number=n_rep)
    print('{} {:.4f}'.format(txt, time))


np_homo = np.zeros(NP_SIZE, dtype=[('a', np.double), ('b', np.double)], order='c')
np_hetro = np.zeros(NP_SIZE, dtype=[('a', np.double), ('b', np.int32)], order='c')
dict_homo = {'a': np.zeros(NP_SIZE), 'b': np.zeros(NP_SIZE)}
dict_hetro = {'a': np.zeros(NP_SIZE), 'b': np.zeros(NP_SIZE, np.int32)}

time(np_homo, 'Homogeneous Numpy struct array')
time(np_hetro, 'Hetoregeneous Numpy struct array')
time(dict_homo, 'Homogeneous Dict of numpy arrays')
time(dict_hetro, 'Hetoregeneous Dict of numpy arrays')

结果：

Homogeneous Numpy struct array 0.7989
Hetoregeneous Numpy struct array 13.5253
Homogeneous Dict of numpy arrays 0.3750
Hetoregeneous Dict of numpy arrays 0.3744

运行之间的比率似乎相当稳定。使用这两种方法和不同大小的数组。

对于关闭情况，这很重要：蟒蛇：3.4 NumPy：1.9.2

【问题讨论】：

由于这个问题询问的是 NumPy 的特定性能问题，而不是一般的批评，因此它已从 Code Review 迁移到 Stack Overflow。
如果你真的想使用结构化数组，我建议你试试pandas。
看到这个问题：github.com/numpy/numpy/issues/6467
我在这里看到了相同的时间。至于np_homo vs. np_hetero，可能和对齐有关，因为np.int64作为第二个dtype并没有那么慢。
@MaxNoe。我在打开这个问题之前看到了它。但是我相信这不是一回事，因为我使用的是 1.9.2 并且问题刚刚出现在 1.10

标签： performance python-3.x numpy

【解决方案1】：

在我的快速计时测试中，差异并不大：

In [717]: dict_homo = {'a': np.zeros(10000), 'b': np.zeros(10000)}
In [718]: timeit dict_homo['a']+=1
10000 loops, best of 3: 25.9 µs per loop
In [719]: np_homo = np.zeros(10000, dtype=[('a', np.double), ('b', np.double)])
In [720]: timeit np_homo['a'] += 1
10000 loops, best of 3: 29.3 µs per loop

在dict_homo 的情况下，数组嵌入字典这一事实是次要问题。像这样简单的字典访问速度很快，基本上和通过变量名访问数组一样。

所以第一种情况基本上是对一维数组的+= 的测试。

在结构化情况下，a 和 b 值在数据缓冲区中交替出现，因此 np_homo['a'] 是一个“拉出”替代数字的视图。所以速度会慢一点也就不足为奇了。

In [721]: np_homo
Out[721]: 
array([(41111.0, 0.0), (41111.0, 0.0), (41111.0, 0.0), ..., (41111.0, 0.0),
       (41111.0, 0.0), (41111.0, 0.0)], 
      dtype=[('a', '<f8'), ('b', '<f8')])

2d 数组也交错列值。

In [722]: np_twod=np.zeros((10000,2), np.double)
In [723]: timeit np_twod[:,0]+=1
10000 loops, best of 3: 36.8 µs per loop

令人惊讶的是，它实际上比结构化案例要慢一些。使用 order='F' 或 (2,10000) 形状会加快速度，但仍不如结构化案例。

这些都是很小的测试时间，所以我不会提出宏大的主张。但结构化数组不回头。

另一次测试，每一步都将数组或字典初始化为新的

In [730]: %%timeit np.twod=np.zeros((10000,2), np.double)
np.twod[:,0] += 1
   .....: 
10000 loops, best of 3: 36.7 µs per loop
In [731]: %%timeit np_homo = np.zeros(10000, dtype=[('a', np.double), ('b', np.double)])
np_homo['a'] += 1
   .....: 
10000 loops, best of 3: 38.3 µs per loop
In [732]: %%timeit dict_homo = {'a': np.zeros(10000), 'b': np.zeros(10000)}
dict_homo['a'] += 1
   .....: 
10000 loops, best of 3: 25.4 µs per loop

2d 和结构化更接近，在字典 (1d) 情况下性能更好。我也用np.ones 尝试过这个，因为np.zeros 可以有延迟分配，但在行为上没有区别。

【讨论】：

嗯。那很有意思。尤其是第一个结果。您是否尝试增加元素的大小？只是为了确保所需的时间不受某些常数的支配。