使用 Cython 启用并行性答案

【问题标题】：Enabling Parallelism with Cython使用 Cython 启用并行性
【发布时间】：2018-04-09 04:36:42
【问题描述】：

我正在尝试让 Cython 的 parallel 包的 prange 函数工作，似乎没有并行性生效。为了拥有一个 MWE，我从Cython: A Guide for Python Programmers 书中获取了示例代码，并通过添加一些打印语句对其进行了一些修改。示例代码可在github 免费获得，我所指的代码位于：examples/12-parallel-cython/02-prange-parallel-loops/。

以下是我对julia.pyx文件的修改。

# distutils: extra_compile_args = -fopenmp
# distutils: extra_link_args = -fopenmp

from cython cimport boundscheck, wraparound
from cython cimport parallel

import numpy as np

cdef inline double norm2(double complex z) nogil:
    return z.real * z.real + z.imag * z.imag


cdef int escape(double complex z,
                double complex c,
                double z_max,
                int n_max) nogil:

    cdef:
        int i = 0
        double z_max2 = z_max * z_max

    while norm2(z) < z_max2 and i < n_max:
        z = z * z + c
        i += 1

    return i


@boundscheck(False)
@wraparound(False)
def calc_julia(int resolution, double complex c,
               double bound=1.5, double z_max=4.0, int n_max=1000):

    cdef:
        double step = 2.0 * bound / resolution
        int i, j
        double complex z
        double real, imag
        int[:, ::1] counts

    counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)

    for i in parallel.prange(resolution + 1, nogil=True,
                    schedule='static', chunksize=1):
        real = -bound + i * step
        for j in range(resolution + 1):
            imag = -bound + j * step
            z = real + imag * 1j
            counts[i,j] = escape(z, c, z_max, n_max)

    return np.asarray(counts)

@boundscheck(False)
@wraparound(False)
def julia_fraction(int[:,::1] counts, int maxval=1000):
    cdef:
        unsigned int thread_id
        int total = 0
        int i, j, N, M
    N = counts.shape[0]; M = counts.shape[1]
    print("N = %d" % N)
    with nogil:
        for i in parallel.prange(N, schedule="static", chunksize=10):
            thread_id = parallel.threadid()
            with gil:
                print("Thread %d." % (thread_id))
            for j in range(M):
                if counts[i,j] == maxval:
                    total += 1
    return total / float(counts.size)

当我使用setup_julia.py 给出的编译时

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension

setup(name="julia",
      ext_modules=cythonize(Extension('julia', ['julia.pyx'], extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'])))

用命令

python setup_julia.py build_ext --inplace

然后运行 run_julia.py 文件，我看到 for 循环的所有实例只使用一个线程 -- Thread 0。终端输出如下所示。

poulin8:02-prange-parallel-loops poulingroup$ python run_julia.py 
time: 0.892143
julia fraction: N = 1001
Thread 0.
Thread 0.
Thread 0.
Thread 0.
.
.
.
.
Thread 0.
0.236994773458

据我了解，for 循环只是并行运行。有人可以指导我如何启动 for 循环以在多个线程之间分配负载吗？我还尝试将系统变量OMP_NUM_THREADS 设置为大于 1 的某个数字，但没有任何效果。

我在 OSX 10.11.6 上运行测试，使用 Python 2.7.10 和 gcc 5.2.0。

【问题讨论】：

您在并行循环的每次迭代中都重新获取 GIL。我认为这正在扼杀循环的并行性。尝试仅在并行循环的每 N 次迭代中重新获取 GIL，其中 N 使得循环能够在重新获取 GIL 的每个实例之间完成大量工作。
@ngoldbaum 它应该仍然可以工作（即在多个线程上运行），虽然效率不如没有with gil: print...
尝试fprintf("Thread %d\n", thread_id) 而不是 python 打印。 fprintf 可以在 stdio.h 中找到。
@danny 那是%%cython from libc.stdio cimport FILE, stdout, fprintf fprintf(stdout, "%d\n", <int>thread_id)

标签： multithreading openmp cython cythonize

【解决方案1】：

我在 Windows 7 上遇到了同样的问题。它正在串行运行。注意编译消息：

python setup_julia.py build_ext --inplace

cl : 命令行警告 D9002 : 忽略未知选项 '-fopenmp'

显然在 Visual Studio 中它必须是 -openmp

# distutils: extra_compile_args = -openmp
# distutils: extra_link_args = -openmp

现在并行运行。

正如@danny 所说，您可以使用 fprintf：

from cython.parallel cimport prange, threadid
from libc.stdio cimport stdout, fprintf

def julia_fraction(int[:,::1] counts, int maxval=1000):
   ...
   thread_id = threadid()
   fprintf(stdout, "%d\n", <int>thread_id)
   ...

【讨论】：