为什么 numba 这么快？答案

【问题标题】：Why is numba so fast?为什么 numba 这么快？
【发布时间】：2022-01-14 16:52:44
【问题描述】：

我想编写一个函数，它将采用形状为(N_ROWS,) 的索引lefts 我想编写一个函数，它将创建一个矩阵out = (N_ROWS, N_COLS) 矩阵，使得out[i, j] = 1 当且仅当j >= lefts[i]。在循环中执行此操作的简单示例如下：

class Looped(Strategy):
    def copy(self, lefts):
        out = np.zeros([N_ROWS, N_COLS])
        for k, l in enumerate(lefts): 
            out[k, l:] = 1
        return out

现在我希望它尽可能快，所以我对这个函数有不同的实现：

普通的 python 循环
cython 实现
numba 与@njit
我用ctypes 调用的纯c++ 实现

以下是 100 次运行的平均结果：

Looped took 0.0011599776260009093
Cythonised took 0.0006905699110029673
Numba took 8.886413300206186e-05
CPP took 0.00013200821400096175

所以 numba 大约是下一个最快的实现（即 c++ 实现）的 1.5 倍。我的问题是为什么？

我在类似的问题中听说过 cython 速度较慢，因为它没有在编译时设置所有优化标志，但是 cpp 实现是使用 -O3 编译的，这足以让我拥有编译器将进行的所有可能的优化给我？
我不完全明白如何将 numpy 数组交给 c++，我是不是无意中复制了这里的数据？

# numba implementation

@njit
def numba_copy(lefts):
    out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)
    for k, l in enumerate(lefts): 
        out[k, l:] = 1.
    return out

    
class Numba(Strategy):
    def __init__(self) -> None:
        # avoid compilation time when timing 
        numba_copy(np.array([1]))

    def copy(self, lefts):
        return numba_copy(lefts)


// array copy cpp

extern "C" void copy(const long *lefts,  float *outdatav, int n_rows, int n_cols) 
{   
    for (int i = 0; i < n_rows; i++) {
        for (int j = lefts[i]; j < n_cols; j++){
            outdatav[i*n_cols + j] = 1.;
        }
    }
}

// compiled to a .so using g++ -O3 -shared -o array_copy.so array_copy.cpp

# using cpp implementation

class CPP(Strategy):

    def __init__(self) -> None:
        lib = ctypes.cdll.LoadLibrary("./array_copy.so")
        fun = lib.copy
        fun.restype = None
        fun.argtypes = [
            ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),
            ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),
            ctypes.c_long,
            ctypes.c_long,
            ]
        self.fun = fun

    def copy(self, lefts):
        outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )
        self.fun(lefts, outdata, N_ROWS, N_COLS)
        return outdata

包含时间等的完整代码：

import time
import ctypes
from itertools import combinations

import numpy as np
from numpy.ctypeslib import ndpointer
from numba import njit


N_ROWS = 1000
N_COLS = 1000


class Strategy:

    def copy(self, lefts):
        raise NotImplementedError

    def __call__(self, lefts):
        s = time.perf_counter()
        n = 1000
        for _ in range(n):
            out = self.copy(lefts)
        print(f"{type(self).__name__} took {(time.perf_counter() - s)/n}")
        return out


class Looped(Strategy):
    def copy(self, lefts):
        out = np.zeros([N_ROWS, N_COLS])
        for k, l in enumerate(lefts): 
            out[k, l:] = 1
        return out


@njit
def numba_copy(lefts):
    out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)
    for k, l in enumerate(lefts): 
        out[k, l:] = 1.
    return out


class Numba(Strategy):
    def __init__(self) -> None:
        numba_copy(np.array([1]))

    def copy(self, lefts):
        return numba_copy(lefts)


class CPP(Strategy):

    def __init__(self) -> None:
        lib = ctypes.cdll.LoadLibrary("./array_copy.so")
        fun = lib.copy
        fun.restype = None
        fun.argtypes = [
            ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),
            ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),
            ctypes.c_long,
            ctypes.c_long,
            ]
        self.fun = fun

    def copy(self, lefts):
        outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )
        self.fun(lefts, outdata, N_ROWS, N_COLS)
        return outdata


def copy_over(lefts):
    strategies = [Looped(), Numba(), CPP()]

    outs = []
    for strategy in strategies:
        o = strategy(lefts)
        outs.append(o)

    for s_0, s_1 in combinations(outs, 2):
        for a, b in zip(s_0, s_1):
            np.testing.assert_allclose(a, b)
    

if __name__ == "__main__":
    copy_over(np.random.randint(0, N_COLS, size=N_ROWS))

【问题讨论】：

哇，如果 numba 让你的 Python 比手写 C++ 更快，那它真是太棒了！
老实说，您正在将一个可能已经由非常聪明的程序员在几个月甚至几年内高度优化的包与您尝试的基本上双嵌套循环进行比较。毫无疑问谁会赢得那场比赛。
一些基本优化加上正确的编译器标志可能会提高 c++ 性能：godbolt.org/z/Kz3MWvPEd
@AlanBirtles 这让我的成绩提高了 1.28 倍
Numba 实际上比 C++ 代码有优势：N_ROWS 和 N_COLS 是 Numba 的硬编码常量，而它们在 C++ 中是变量。例如，这可能允许它展开一些循环

标签： python c++ numpy cython numba

【解决方案1】：

Numba 目前使用 LLVM-Lite 将代码高效地编译为二进制文件（在 Python 代码被转换为 LLVM 中间表示之后）。代码经过优化，就像 C++ 代码将使用带有标志 -O3 和 -march=native 的 Clang。最后一个参数非常重要，因为它使 LLVM 能够在相对较新的 x86-64 处理器上使用更宽的 SIMD 指令：AVX 和 AVX2（对于最近的 Intel 处理器可能是 AVX512）。否则，默认情况下 Clang 和 GCC 仅使用 SSE/SSE2 指令（因为向后兼容性）。

另一个区别来自 GCC 和 Numba 的 LLVM 代码之间的比较。 Clang/LLVM 倾向于积极展开循环，而 GCC 通常不会。这对生成的程序有显着的性能影响。其实可以看到generated assembly code from Clang:

使用 Clang（每个循环 128 个项目）：

.LBB0_7:
        vmovups ymmword ptr [r9 + 4*r8 - 480], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 448], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 416], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 384], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 352], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 320], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 288], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 256], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 224], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 192], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 160], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 128], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 96], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 64], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 32], ymm0
        vmovups ymmword ptr [r9 + 4*r8], ymm0
        sub     r8, -128
        add     rbp, 4
        jne     .LBB0_7

使用 GCC（每个循环 8 个项目）：

.L5:
        mov     rdx, rax
        vmovups YMMWORD PTR [rax], ymm0
        add     rax, 32
        cmp     rdx, rcx
        jne     .L5

因此，为了公平起见，您需要将 Numba 代码与使用 Clang 编译的 C++ 代码和上述优化标志进行比较。

请注意，根据您的需求和最后一级处理器缓存的大小，您可以使用非临时存储（NT 存储）编写更快的特定于平台的 C++ 代码。 NT 存储告诉处理器不要将数组存储在其缓存中。使用 NT 存储写入数据在 RAM 中写入大数组时速度更快，但如果数组可以放入缓存中，则在复制后读取存储的数组时速度会变慢（因为必须从 RAM 重新加载数组）。在您的情况下（4 MiB 阵列），这是否会更快尚不清楚。

【讨论】：