结合 cython 的 nogil 使用 threadpoolexecutor答案

【问题标题】：Usage of threadpoolexecutor in conjunction with cython's nogil结合 cython 的 nogil 使用 threadpoolexecutor
【发布时间】：2019-06-11 06:46:13
【问题描述】：

我已经阅读了这个问题和答案 -Cython nogil with ThreadPoolExecutor not giving speedups，我的 Cython 代码也遇到了类似的问题，尽管我的系统有多个内核，但没有获得预期的加速。我在 Ubuntu 18.04 实例上有 4 个物理内核，如果我在下面的代码中将作业数设为 1，它的运行速度将比我设为 4 时更快。使用 top 查看 CPU 使用率，我发现 CPU 使用率高达 300 %。我正在查找未修改的 C++ 类中的数据结构，即我仅通过 Cython 对 C++ 数据结构进行只读查询。 C++ 端没有任何互斥锁。

这是我第一次使用 GIL，我想知道我是否用错了它。此外，时间的输出有点令人困惑，因为我认为它没有正确描述每个工作线程所花费的实际时间。

我似乎错过了一些重要的东西，但我无法弄清楚它是什么，因为我几乎使用了相同的模板来使用 GIL，如链接的 SO 答案中所示。

import psutil
import numpy as np

from concurrent.futures import ThreadPoolExecutor
from functools import partial



cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
    Rectangle(int, int, int, int)
    int x0, y0, x1, y1
    int getArea() nogil


cdef class PyRectangle:
     cdef Rectangle *rect 

def __cinit__(self, int x0, int y0, int x1, int y1):
    self.rect = new Rectangle(x0, y0, x1, y1)

def __dealloc__(self):
    del self.rect

def testThread(self):

    latGrid = np.arange(minLat,maxLat,0.05)
    lonGrid = np.arange(minLon,maxLon,0.05)

    gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
    grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]

    n_jobs = psutil.cpu_count(logical=False)

    chunk = np.array_split(grid_points,n_jobs,axis=0)
    x = ThreadPoolExecutor(max_workers=n_jobs) 

    t0 = time.time()
    func = partial(self.performCalc,maxDistance)
    results = x.map(func,chunk)
    results = np.vstack(list(results))
    t1 = time.time()
    print(t1-t0)

def performCalc(self,maxDistance,chunk):

    cdef int area
    cdef double[:,:] gPoints
    gPoints = memoryview(chunk)
    for i in range(0,len(gPoints)):
        with nogil:
            area =  self.getArea2(gPoints[i])
    return area

cdef int getArea2(self,double[:] p) nogil :
    cdef int area
    area = self.rect.getArea()
    return area

【问题讨论】：

在performCalc 中，您应该输入i 并将len(gPoints) 替换为gPoints.shape[0]。那应该让你完成整个循环nogil
@DavidW 感谢您帮助我。我认为 with nogil 仍然在 for 循环中？我这样做了，我得到一个 Coercion from Python not allowed without the GIL 编译错误
我认为 nogil 应该能够走出循环（我希望这将是一个很大的改进）。我怀疑它可以工作，但听起来我需要一个合适的外观
你不小心把range丢了！
@DavidW 确实有效。我现在在混合 cython 和 python 时遇到了其他问题，但如果需要，我会将这些问题作为单独的问题提出。如果你愿意，你可以写出来，我会接受

标签： python-3.x multithreading performance cython gil

【解决方案1】：

我的建议（在 cmets 中）是确保整个 performCalc 循环是 nogil。为此，需要进行一些更改：

cdef Py_ssize_t i # set type of "i" (although Cython can possibly deduce this anyway)
with nogil:
    for i in range(0,gPoints.shape[0]):
        area =  self.getArea2(gPoints[i])

其中最重要的是将len(gPoints) 替换为gPoints.shape[0]，它将对Python 函数的调用替换为数组查找（我个人认为len 对于二维数组没有意义）。

获取和发布 GIL 基本上是有成本的。您希望确保在没有 GIL 的情况下完成的工作值得花时间处理它。简单地计算一个矩形的面积是非常微不足道的（两个减法和一个乘法），因此并不能真正证明在线程之间协调 GIL 所花费的时间是合理的 - 请记住，一旦每个循环，每个线程都必须（简要地）持有 GIL，在此期间时间没有其他线程可以容纳它。然而，随着整个循环为nogil，管理它所花费的时间变得很小。

【讨论】：