【发布时间】:2022-01-14 16:52:44
【问题描述】:
我想编写一个函数,它将采用形状为(N_ROWS,) 的索引lefts 我想编写一个函数,它将创建一个矩阵out = (N_ROWS, N_COLS) 矩阵,使得out[i, j] = 1 当且仅当j >= lefts[i]。在循环中执行此操作的简单示例如下:
class Looped(Strategy):
def copy(self, lefts):
out = np.zeros([N_ROWS, N_COLS])
for k, l in enumerate(lefts):
out[k, l:] = 1
return out
现在我希望它尽可能快,所以我对这个函数有不同的实现:
- 普通的 python 循环
- cython 实现
- numba 与
@njit - 我用
ctypes调用的纯c++ 实现
以下是 100 次运行的平均结果:
Looped took 0.0011599776260009093
Cythonised took 0.0006905699110029673
Numba took 8.886413300206186e-05
CPP took 0.00013200821400096175
所以 numba 大约是下一个最快的实现(即 c++ 实现)的 1.5 倍。我的问题是为什么?
- 我在类似的问题中听说过 cython 速度较慢,因为它没有在编译时设置所有优化标志,但是 cpp 实现是使用
-O3编译的,这足以让我拥有编译器将进行的所有可能的优化给我? - 我不完全明白如何将 numpy 数组交给 c++,我是不是无意中复制了这里的数据?
# numba implementation
@njit
def numba_copy(lefts):
out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)
for k, l in enumerate(lefts):
out[k, l:] = 1.
return out
class Numba(Strategy):
def __init__(self) -> None:
# avoid compilation time when timing
numba_copy(np.array([1]))
def copy(self, lefts):
return numba_copy(lefts)
// array copy cpp
extern "C" void copy(const long *lefts, float *outdatav, int n_rows, int n_cols)
{
for (int i = 0; i < n_rows; i++) {
for (int j = lefts[i]; j < n_cols; j++){
outdatav[i*n_cols + j] = 1.;
}
}
}
// compiled to a .so using g++ -O3 -shared -o array_copy.so array_copy.cpp
# using cpp implementation
class CPP(Strategy):
def __init__(self) -> None:
lib = ctypes.cdll.LoadLibrary("./array_copy.so")
fun = lib.copy
fun.restype = None
fun.argtypes = [
ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),
ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),
ctypes.c_long,
ctypes.c_long,
]
self.fun = fun
def copy(self, lefts):
outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )
self.fun(lefts, outdata, N_ROWS, N_COLS)
return outdata
包含时间等的完整代码:
import time
import ctypes
from itertools import combinations
import numpy as np
from numpy.ctypeslib import ndpointer
from numba import njit
N_ROWS = 1000
N_COLS = 1000
class Strategy:
def copy(self, lefts):
raise NotImplementedError
def __call__(self, lefts):
s = time.perf_counter()
n = 1000
for _ in range(n):
out = self.copy(lefts)
print(f"{type(self).__name__} took {(time.perf_counter() - s)/n}")
return out
class Looped(Strategy):
def copy(self, lefts):
out = np.zeros([N_ROWS, N_COLS])
for k, l in enumerate(lefts):
out[k, l:] = 1
return out
@njit
def numba_copy(lefts):
out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)
for k, l in enumerate(lefts):
out[k, l:] = 1.
return out
class Numba(Strategy):
def __init__(self) -> None:
numba_copy(np.array([1]))
def copy(self, lefts):
return numba_copy(lefts)
class CPP(Strategy):
def __init__(self) -> None:
lib = ctypes.cdll.LoadLibrary("./array_copy.so")
fun = lib.copy
fun.restype = None
fun.argtypes = [
ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),
ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),
ctypes.c_long,
ctypes.c_long,
]
self.fun = fun
def copy(self, lefts):
outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )
self.fun(lefts, outdata, N_ROWS, N_COLS)
return outdata
def copy_over(lefts):
strategies = [Looped(), Numba(), CPP()]
outs = []
for strategy in strategies:
o = strategy(lefts)
outs.append(o)
for s_0, s_1 in combinations(outs, 2):
for a, b in zip(s_0, s_1):
np.testing.assert_allclose(a, b)
if __name__ == "__main__":
copy_over(np.random.randint(0, N_COLS, size=N_ROWS))
【问题讨论】:
-
哇,如果 numba 让你的 Python 比手写 C++ 更快,那它真是太棒了!
-
老实说,您正在将一个可能已经由非常聪明的程序员在几个月甚至几年内高度优化的包与您尝试的基本上双嵌套循环进行比较。毫无疑问谁会赢得那场比赛。
-
一些基本优化加上正确的编译器标志可能会提高 c++ 性能:godbolt.org/z/Kz3MWvPEd
-
@AlanBirtles 这让我的成绩提高了 1.28 倍
-
Numba 实际上比 C++ 代码有优势:
N_ROWS和N_COLS是 Numba 的硬编码常量,而它们在 C++ 中是变量。例如,这可能允许它展开一些循环
标签: python c++ numpy cython numba