Cython 函数指针解引用时间（与直接调用函数相比）答案

【问题标题】：Cython Function Pointer Dereference Time (Compared to Calling Function Directly)Cython 函数指针解引用时间（与直接调用函数相比）
【发布时间】：2019-03-03 05:28:54
【问题描述】：

我有一些 Cython 代码，涉及对以下形式的 Numpy 数组（表示 BGR 图像）进行极其重复的逐像素操作：

ctypedef double (*blend_type)(double, double) # function pointer
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef cnp.ndarray[cnp.float_t, ndim=3] blend_it(const double[:, :, :] array_1, const double[:, :, :] array_2, const blend_type blendfunc, const double opacity):
  # the base layer is a (array_1)
  # the blend layer is b (array_2)
  # base layer is below blend layer
  cdef Py_ssize_t y_len = array_1.shape[0]
  cdef Py_ssize_t x_len = array_1.shape[1]
  cdef Py_ssize_t a_channels = array_1.shape[2]
  cdef Py_ssize_t b_channels = array_2.shape[2]
  cdef cnp.ndarray[cnp.float_t, ndim=3] result = np.zeros((y_len, x_len, a_channels), dtype = np.float_)
  cdef double[:, :, :] result_view = result
  cdef Py_ssize_t x, y, c

  for y in range(y_len):
    for x in range(x_len):
      for c in range(3): # iterate over BGR channels first
        # calculate channel values via blend mode
        a = array_1[y, x, c]
        b = array_2[y, x, c]
        result_view[y, x, c] = blendfunc(a, b)
        # many other operations involving result_view...
  return result;

其中blendfunc指的是另一个cython函数，比如下面的overlay_pix：

cdef double overlay_pix(double a, double b):
  if a < 0.5:
    return 2*a*b
  else:
    return 1 - 2*(1 - a)*(1 - b)

使用函数指针的目的是避免为每种混合模式（其中有很多）一遍又一遍地重写大量重复的代码。因此，我为每种混合模式创建了一个这样的界面，省去了我的麻烦：

def overlay(double[:, :, :] array_1, double[:, :, :] array_2, double opacity = 1.0):
  return blend_it(array_1, array_2, overlay_pix, opacity)

但是，这似乎花费了我一些时间！我注意到，对于非常大的图像（例如 8K 和更大的图像），在 blend_it 函数中使用 blendfunc 而不是在 blend_it 中直接调用 overlay_pix 会浪费大量时间。我认为这是因为blend_it 在迭代中每次都必须取消引用函数指针，而不是让函数立即可用，但我不确定。

时间损失并不理想，但我当然不想为每种混合模式一遍又一遍地重写blend_it。有什么方法可以避免时间损失吗？有没有办法将函数指针变成循环外的本地函数，然后在循环内更快地访问它？

【问题讨论】：

标签： python function numpy pointers cython

【解决方案1】：

@ead's answer 说了两件事：

C 可能能够将其优化为直接调用。除了相当简单的情况外，我认为这通常不正确，并且对于 OP 正在使用的编译器和代码似乎也不正确。
在 C++ 中，您会使用模板来代替 - 这绝对是正确的，因为模板类型在编译时总是已知的，因此优化通常很容易。

Cython 和 C++ 模板有点乱，所以我认为你不想在这里使用它们。但是 Cython 确实有一个类似模板的功能，称为 fused types。您可以使用融合类型来获得编译时优化，如下所示。代码的大致轮廓是：

定义一个 cdef class，其中包含一个 staticmethod cdef 函数，用于您想要执行的所有操作。
定义一个包含所有cdef classes 的融合类型。（这是这种方法的局限性——它不容易扩展，所以如果你想添加操作，你必须编辑代码）
定义一个函数，该函数采用您的融合类型的虚拟参数。使用此类型调用staticmethod。
定义包装函数 - 您需要使用显式的 [type] 语法才能使其工作。

代码：

import cython

cdef class Plus:
    @staticmethod
    cdef double func(double x):
        return x+1    

cdef class Minus:
    @staticmethod
    cdef double func(double x):
        return x-1

ctypedef fused pick_func:
    Plus
    Minus

cdef run_func(double [::1] x, pick_func dummy):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = cython.typeof(dummy).func(x[i])
    return x.base

def run_func_plus(x):
    return run_func[Plus](x,Plus())

def run_func_minus(x):
    return run_func[Minus](x,Minus())

为了比较，使用函数指针的等效代码是

cdef double add_one(double x):
    return x+1

cdef double minus_one(double x):
    return x-1

cdef run_func_ptr(double [::1] x, double (*f)(double)):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = f(x[i])
    return x.base

def run_func_ptr_plus(x):
    return run_func_ptr(x,add_one)

def run_func_ptr_minus(x):
    return run_func_ptr(x,minus_one)

使用 timeit 与使用函数指针相比，我得到了大约 2.5 倍的加速。这表明函数指针没有为我优化（但是我没有尝试更改编译器设置来尝试改进）

import numpy as np
import example

# show the two methods give the same answer
print(example.run_func_plus(np.ones((10,))))
print(example.run_func_minus(np.ones((10,))))

print(example.run_func_ptr_plus(np.ones((10,))))
print(example.run_func_ptr_minus(np.ones((10,))))

from timeit import timeit

# timing comparison
print(timeit("""run_func_plus(x)""",
             """from example import run_func_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

print(timeit("""run_func_ptr_plus(x)""",
             """from example import run_func_ptr_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

【讨论】：

我也考虑过融合类型，但还不够聪明，无法将函数包装到类中。顺便说一句，在我的系统（linux+gcc5）上运行时间没有区别，因为显然 gcc 能够内联这些简单的功能。
我是 Linux GCC 8.2，所以它未能与应该是更好的版本内联有点令人惊讶。我猜想有一些编译器标志的组合可以得到它。不过，这似乎不是一个可靠的优化。

【解决方案2】：

确实，使用函数指针可能会产生一些额外的成本，但大多数时候性能下降是由于编译器不再能够内联被调用的函数并在可能的情况下执行优化内联。

我想在下面的例子中证明这一点，这个例子比你的要小：

int f(int i){
    return i;
}

int sum_with_fun(){
    int sum=0;
    for(int i=0;i<1000;i++){
        sum+=f(i);
    }
    return sum;
}

typedef int(*fun_ptr)(int);
int sum_with_ptr(fun_ptr ptr){
    int sum=0;
    for(int i=0;i<1000;i++){
        sum+=ptr(i);
    }
    return sum;
}

所以计算sum f(i) for i=0...999有两个版本：带函数指针和直接。

当使用-fno-inline 编译时（即禁用内联以平整地面），它们会生成几乎相同的汇编程序（此处为godbolt.org）- 略有不同的是函数的调用方式：

callq  4004d0 <_Z1fi>  //direct call
...
callq  *%r12           //via ptr

在性能方面，这不会有太大的不同。

但是当我们删除 -fno-inline 时，编译器可以为直接版本发光，因为它变成了（在 godbolt.org 上）

_Z12sum_with_funv:
        movl    $499500, %eax
        ret

即整个循环在编译过程中进行评估，与未更改的间接版本相比，后者需要在运行时执行循环：

_Z12sum_with_ptrPFiiE:
        pushq   %r12
        movq    %rdi, %r12
        pushq   %rbp
        xorl    %ebp, %ebp
        pushq   %rbx
        xorl    %ebx, %ebx
.L5:
        movl    %ebx, %edi
        addl    $1, %ebx
        call    *%r12
        addl    %eax, %ebp
        cmpl    $1000, %ebx
        jne     .L5
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        popq    %r12
        ret

那么，它会把你留在哪里？您可以使用已知指针包装间接函数，并且编译器能够执行上述优化的机会很高，例如：

... 
int sum_with_f(){
    return sum_with_ptr(&f);
}

结果（在godbolt.org）：

_Z10sum_with_fv:
        movl    $499500, %eax
        ret

使用上述方法，您可以任由编译器（但现代编译器）来执行内联。

还有其他选择，具体取决于您实际使用的内容：

在 C++ 中，有模板可以消除这种重复工作而不会降低性能。
在 C 中可以使用具有相同效果的宏。
Numpy 使用预处理器生成重复代码，例如参见 src-file，在预处理步骤中将生成 c 文件。
pandas 对 cython 代码使用类似于 numpy 的方法，例如参见 hashtable_func_helper.pxi.in-file。

【讨论】：