加速我的 Python/Cython 代码的提示答案

【问题标题】：Tips for speeding up my Python/Cython code加速我的 Python/Cython 代码的提示
【发布时间】：2017-11-23 05:32:55
【问题描述】：

我编写了一个 Python 程序并对其进行 Cythonize。 Cython (30%) 获得的加速效果并不令人满意。通过更改代码结构或 Cythonized 方式，肯定有优化它的空间。

该程序基本上是采用数字高程模型（DEM）栅格图和具有相同形状的过量水图。对于溢出水图中的每个像素，它会搜索四个相邻像素，并确定该像素是否低于周围的邻居、具有相同的水平或高于周围的邻居。基于此，它会增加像素处的水位或将其多余的水分配给具有较低海拔的邻居。代码继续执行，直到所有多余的水都散布在地面上。这是代码的 Cython 版本。

import numpy as np
cimport numpy as np

cdef unravel(np.ndarray[np.int_t, ndim = 1] idx,int shape0, int shape1):
    return idx//shape0, idx%shape1

cdef find_lower_neighbours(int i, int j, np.ndarray[np.double_t, ndim=2] water_level, double friction_head_loss):

    cdef double current_water_el, minlevels, deltav_total, deltav_min
    current_water_el = water_level[i,j]
    cdef np.ndarray[np.double_t, ndim = 1] levels = np.zeros(4, dtype = np.double)

    levels[:] = water_level[i - 1, j], water_level[i, j + 1], water_level[i + 1, j], water_level[i, j - 1]
    minlevels = levels.min()

    if current_water_el - minlevels < 0:
        return 0, minlevels, 0

    elif np.absolute(current_water_el - minlevels) < 0.0001:
        res = np.where(np.absolute(levels - current_water_el) < 0.0001)
        return 1, res[0], 0

    else:
        levels = current_water_el - levels
        low_values_flags = levels < 0
        levels[low_values_flags] = 0

        deltav_total = np.sum(levels)
        deltav_min = levels[levels > 0].min()
        return 2, levels / (deltav_total + deltav_min), deltav_min / (deltav_total + deltav_min)


cpdef np.ndarray[np.double_t, ndim=2] new_algorithm( np.ndarray[np.double_t, ndim=2] DEM, np.ndarray[np.double_t, ndim=2] extra_volume_map, double nodata, double pixel_area, double friction_head_loss , int von_neuman):

    cdef int terminate = 0
    cdef int iteration = 1

    cdef double sum_extra_volume_map


    index_dic_von_neuman = [[-1, 0], [0, 1], [1, 0], [0, -1]]
    cdef int DEMshape0 = DEM.shape[0]
    cdef int DEMshape1 = DEM.shape[1]

    cdef np.ndarray[np.double_t, ndim = 2] water_levels
    water_levels = np.copy(DEM)

    cdef np.ndarray[np.double_t, ndim = 2] temp_water_levels
    temp_water_levels = np.copy(water_levels)

    cdef np.ndarray[np.double_t, ndim = 2] temp_extra_volume_map
    temp_extra_volume_map = np.copy(extra_volume_map)


    cdef np.ndarray[np.int_t, ndim = 2] wetcells
    wetcells = np.zeros((DEMshape0,DEMshape1), dtype= np.int)
    wetcells[extra_volume_map > 0] = 1

    cdef np.ndarray[np.int_t, ndim = 2] temp_wetcells
    temp_wetcells = np.zeros((DEMshape0,DEMshape1), dtype= np.int )

    cdef double min_excess = friction_head_loss * pixel_area
    cdef np.ndarray[np.int_t, ndim = 1] fdx
    cdef int i, j , k, condition, have_any_dry_cells_in_neghbors
    cdef double water_level_difference, n , volume_to_each_neghbour, weight, w0

    if von_neuman == 0:
        index_dic = index_dic_von_neuman


    while terminate != 1:
        fdx = np.flatnonzero(extra_volume_map > min_excess)
        extra_volume_locations = unravel(fdx, DEMshape1,DEMshape1)
        if not extra_volume_locations[0].size:
            terminate = 1
            return water_levels

        for item in zip(*extra_volume_locations):
            i = item[0]
            j = item[1]

            if DEM[i,j] == nodata:
                print "warningggggg", i, j
                temp_extra_volume_map[i,j] = 0.
            else:
                condition, wi, w0 = find_lower_neighbours(i, j, water_levels, friction_head_loss)

                if condition == 0:
                    water_level_difference = wi - water_levels[i,j]
                    temp_water_levels[i,j] = wi
                    temp_extra_volume_map[i,j] -= water_level_difference * pixel_area

                elif condition == 1:
                    n = len(wi)
                    volume_to_each_neghbour = (extra_volume_map[i, j] - friction_head_loss * pixel_area * .02)/ (n * 1. )
                    for itemm in wi:
                        temp_extra_volume_map[i + index_dic[itemm][0], j + index_dic[itemm][1]] += volume_to_each_neghbour
                        temp_wetcells[i + index_dic[itemm][0], j + index_dic[itemm][1]] += 1
                    temp_extra_volume_map[i,j] -= extra_volume_map[i,j]
                    temp_water_levels[i,j] += friction_head_loss * .02

                elif condition == 2:
                    temp_wetcells[i,j] += 1
                    have_any_dry_cells_in_neghbors = 0 # means "no"
                    for k, weight in enumerate(wi):
                        if weight > 0:
                            temp_extra_volume_map[i + index_dic[k][0], j + index_dic[k][1]] += weight * (extra_volume_map[i,j])
                            if wetcells[i + index_dic[k][0], j + index_dic[k][1]] == 0:
                                have_any_dry_cells_in_neghbors = 1
                            temp_wetcells[i + index_dic[k][0], j + index_dic[k][1]] += 1
                    if have_any_dry_cells_in_neghbors == 1:
                        temp_water_levels[i,j] += friction_head_loss
                        temp_extra_volume_map[i, j] -= (1. - w0) * (extra_volume_map[i, j])
                    else:
                        temp_extra_volume_map[i, j] -= (1. - w0) * (extra_volume_map[i, j])


        wetcells = np.copy(temp_wetcells)
        water_levels = np.copy(temp_water_levels)
        extra_volume_map = np.copy(temp_extra_volume_map)

        iteration += 1
        if iteration*1. %500. == 0.:
            sum_extra_volume_map = np.sum(extra_volume_map)
            print "iteration", iteration, "volume left =",  sum_extra_volume_map


    print "finished at iteration =", iteration - 1

    return water_levels

这是分析的结果：

6712150 function calls in 29.005 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000   29.005   29.005 <string>:1(<module>)
2017444    0.497    0.000    4.414    0.000 _methods.py:28(_amin)
289757    0.070    0.000    0.543    0.000 _methods.py:31(_sum)
289757    0.437    0.000    1.098    0.000 fromnumeric.py:1730(sum)
506055    0.203    0.000    0.705    0.000 function_base.py:1453(copy)
168685    0.140    0.000    0.300    0.000 numeric.py:859(flatnonzero)
     1    0.000    0.000    0.000    0.000 test.py:22(print_dem_for_excel)
     1    0.000    0.000   29.005   29.005 test.py:27(test)
289757    0.119    0.000    0.119    0.000 {isinstance}
     1    0.000    0.000    0.000    0.000 {len}
    20    0.000    0.000    0.000    0.000 {map}
    20    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    20    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
168685    0.106    0.000    0.106    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
168685    0.054    0.000    0.054    0.000 {method 'ravel' of 'numpy.ndarray' objects}
2307201    4.390    0.000    4.390    0.000 {method 'reduce' of 'numpy.ufunc' objects}
 506056    0.501    0.000    0.501    0.000 {numpy.core.multiarray.array}
     1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.zeros}
     1    0.000    0.000    0.000    0.000 {time.time}
     1   22.488   22.488   29.005   29.005 {version3Cython.new_algorithm}

【问题讨论】：

不要指望 SO 做你的工作.....
如果我的问题被这样解释，我很抱歉。我认为通过放置整个算法，专业人士可以看到我什至不知道可以改进的领域。
cython -a 是一个不错的首选。像无类型的index_dic_von_neuman = ... 这样的东西会影响你的表现。
@Veedrac 我的代码中的无类型项目是 Python 对象（列表、字典等）我应该尽量避免使用它们还是有特定的方法来声明它们的类型？我在解释 -a html 文件时遇到问题，我应该学习如何使用它。感谢您的提示。

标签： python performance cython cellular-automata

【解决方案1】：

您需要首先profile您的程序并精确测量需要时间的内容（哪些函数消耗的 CPU 时间最多）。

您可以改进程序中的算法并降低某些函数的time complexity。

您还可以手动重写代码的繁重（资源需求，例如 CPU 消耗）部分，例如如Python extensions in（手写）C（也许还有 C++ 中的某些部分或任何易于从 C 调用并与 Python 和 C 的内存模型兼容的语言）。不要指望任何工具（例如 Cython）能够自动高效地做到这一点。

您的程序足够小。考虑花几个星期（或几个月）重写它的某些部分，甚至重新设计和完全重写它。我想静态类型的编译语言（如 Ocaml、Go、D、C++）可以提供一些性能改进。也许您可能会考虑完全重新设计您的代码作为并发应用程序（multi-threaded，或OpenCL，MPI，...）

将现有的 Python 原型视为开始理解您正在解决的问题的一种方式，但要完全重新设计和重写您的代码（可能使用不同的语言）。花一些时间阅读有关该主题的现有文献并与专家讨论。

顺便说一句，您的用户可能会认为不到一分钟的执行时间是可以接受的。然后你不需要重写代码。请记住，您的开发时间也有一些成本！有No Silver Bullet...

【讨论】：

真实案例的运行时间大约需要 3 小时。另外我会同时在很多情况下运行这个算法，所以我可以占用我所有的计算资源。
您决定是否值得花一个月的工作（甚至可能几个月，取决于您当前的技能）是否值得。记住Hofstadter's law。也许和你的博士导师谈谈。
@BasileStarynkevitch 很好的答案
关于分析，Cython 确实支持逐行分析（这可能会提供额外的洞察力）
谢谢@DavidW，不知道。我试试看。