np.tile 的 Numba 兼容实现？答案

【问题标题】：Numba-compatible implementation of np.tile?np.tile 的 Numba 兼容实现？
【发布时间】：2020-05-08 19:04:58
【问题描述】：

我正在编写一些用于去雾图像的代码 based on this paper，我从一个废弃的 Py2.7 implementation 开始。从那时起，尤其是使用 Numba，我已经取得了一些真正的性能改进（很重要，因为我必须在 8K 图像上运行它）。

我非常确信我的最后一个重大性能瓶颈是在执行 box filter step 时（我已经为每张图像缩短了近一分钟，但最后一个缓慢的步骤是 ~30 秒/图像），我已经接近了让它在 Numba 中以 nopython 的身份运行：

@njit # Row dependencies means can't be parallel
def yCumSum(a):
    """
    Numba based computation of y-direction
    cumulative sum. Can't be parallel!
    """
    out = np.empty_like(a)
    out[0, :] = a[0, :]
    for i in prange(1, a.shape[0]):
        out[i, :] = a[i, :] + out[i - 1, :]
    return out

@njit(parallel= True)
def xCumSum(a):
    """
    Numba-based parallel computation
    of X-direction cumulative sum
    """
    out = np.empty_like(a)
    for i in prange(a.shape[0]):
        out[i, :] = np.cumsum(a[i, :])
    return out

@jit
def _boxFilter(m, r, gpu= hasGPU):
    if gpu:
        m = cp.asnumpy(m)
    out = __boxfilter__(m, r)
    if gpu:
        return cp.asarray(out)
    return out

@jit(fastmath= True)
def __boxfilter__(m, r):
    """
    Fast box filtering implementation, O(1) time.
    Parameters
    ----------
    m:  a 2-D matrix data normalized to [0.0, 1.0]
    r:  radius of the window considered
    Return
    -----------
    The filtered matrix m'.
    """
    #H: height, W: width
    H, W = m.shape
    #the output matrix m'
    mp = np.empty(m.shape)

    #cumulative sum over y axis
    ySum = yCumSum(m) #np.cumsum(m, axis=0)
    #copy the accumulated values of the windows in y
    mp[0:r+1,: ] = ySum[r:(2*r)+1,: ]
    #differences in y axis
    mp[r+1:H-r,: ] = ySum[(2*r)+1:,: ] - ySum[ :H-(2*r)-1,: ]
    mp[(-r):,: ] = np.tile(ySum[-1,: ], (r, 1)) - ySum[H-(2*r)-1:H-r-1,: ]

    #cumulative sum over x axis
    xSum = xCumSum(mp) #np.cumsum(mp, axis=1)
    #copy the accumulated values of the windows in x
    mp[:, 0:r+1] = xSum[:, r:(2*r)+1]
    #difference over x axis
    mp[:, r+1:W-r] = xSum[:, (2*r)+1: ] - xSum[:, :W-(2*r)-1]
    mp[:, -r: ] = np.tile(xSum[:, -1][:, None], (1, r)) - xSum[:, W-(2*r)-1:W-r-1]
    return mp

在边缘有很多事情要做，但是如果我可以将 tile 操作作为 nopython 调用，我可以 nopython 整个 boxfilter 步骤并获得很大的性能提升。我不太愿意做一些非常具体的事情，因为我很想在其他地方重用这段代码，但我不会特别反对将它限制在 2D 范围内。无论出于何种原因，我只是盯着这个看，并不确定从哪里开始。

【问题讨论】：

你看过tile代码吗？它使用已编译的np.repeat
您可以通过将数组分配分成几部分来避免tile。顺便说一句，您似乎正在使用tile 扩展一个Sum 切片以与另一个切片一起使用，然后将其分配给mp。我还没有花时间弄清楚细节。
谢谢，我没有意识到 tile 在后台使用 repeat；几个小时后我去看看
你根本不需要瓷砖。只需在没有任何矢量化命令的情况下编写循环即可。这也将导致您的代码加速。（除了 BLAS 调用之外的所有矢量化命令都被转换为（多个）循环。Numba 试图简化这一点，但人类程序员通常在执行这种循环连接方面要好得多。

标签： python numpy numba

【解决方案1】：

np.tile 是要完全重新实现的 bit too complicated，但除非我读错了，否则您似乎只需要获取一个向量，然后沿着不同的轴重复它 r 次。

一种兼容 Numba 的方法是编写

y = x.repeat(r).reshape((-1, r))

那么x会沿着第二个维度重复r次，这样y[i, j] == x[i]。

例子：

In [2]: x = np.arange(5)                                                                                                

In [3]: x.repeat(3).reshape((-1, 3))                                                                                                                                  
Out[3]: 
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])

如果您希望 x 沿第一个维度重复，只需转置 y.T。

【讨论】：

完美，谢谢。不得不做一些copy-slice-reshape fixes 但它确实成功了！
完美！感谢您的回答。如果一维数组需要 np.tile 的确切功能，只需执行以下操作： np.repeat([1,2,3,4], 3).reshape(-1, 3).T.flatten() >> > 数组([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4])