删除 for 循环以加快执行速度 - 向量化答案

【问题标题】：Remove for loops for faster execution - vectorize删除 for 循环以加快执行速度 - 向量化
【发布时间】：2021-06-11 08:14:51
【问题描述】：

作为我学术项目的一部分，我正在研究图像的线性过滤器。下面是代码，仅使用 NumPy（无外部库）并希望通过矢量化或任何其他选项消除 for 循环。如何实现矢量化以加快执行速度？感谢您的帮助。

输入 -

Image.shape - (568, 768)
weightArray.shape - (3, 3)

    def apply_filter(image: np.array, weight_array: np.array) -> np.array:
        rows, cols = image.shape
        height, width = weight_array.shape
        output = np.zeros((rows - height + 1, cols - width + 1))
    
        for rrow in range(rows - height + 1):
            for ccolumn in range(cols - width + 1):
                for hheight in range(height):
                    for wwidth in range(width):
                        imgval = image[rrow + hheight, ccolumn + wwidth]
                        filterval = weight_array[hheight, wwidth]
                        output[rrow, ccolumn] += imgval * filterval
                        
        return output

【问题讨论】：

欢迎来到 StackOverflow @parvy。请创建一个Minimal and Reproducible Example。添加可运行的代码以及一些示例数据。
SciPy 是否不受限制？我很确定这是一个内置的 SciPy。
您可能正在寻找scipy.signal.convolve2d。向量化你的代码当然是可能的，但除非你出于教育目的对此感兴趣，否则我建议使用 scipy。
为什么是 numpy 而不是 scipy？它们是同一个项目的一部分。
所以这是一道作业题？然后你应该展示到目前为止你已经尝试过什么以及你失败的地方。

标签： python python-3.x numpy vectorization

【解决方案1】：

向量化是将每个显式for 循环转换为一维 数组操作的过程。在 Python 中，这将涉及根据 slices 重新构想您的数据。

在下面的代码中，我提供了内核循环的工作向量化。这显示了如何处理矢量化，但由于它只是优化 3x3 数组，因此不会给您带来最大的收益。

如果您想看到更大的改进，您可以对图像数组进行矢量化处理，我也为您制作了模板，但留下了一些作为练习。

import numpy as np
from PIL import Image

## no vectorization
def applyFilterMethod1(image: np.array, weightArray: np.array) -> np.array:
    rows, cols = image.shape ; height, width = weightArray.shape
    output = np.zeros((rows - height + 1, cols - width + 1))

    for rrow in range(rows - height + 1):
        for ccolumn in range(cols - width + 1):
            for hheight in range(height):
                for wwidth in range(width):
                    imgval = image[rrow + hheight, ccolumn + wwidth]
                    filterval = weightArray[hheight, wwidth]
                    output[rrow, ccolumn] += imgval * filterval
                    
    return output

## vectorize the kernel loop (~3x improvement)
def applyFilterMethod2(image: np.array, weightArray: np.array) -> np.array:
    rows, cols = image.shape ; height, width = weightArray.shape
    output = np.zeros((rows - height + 1, cols - width + 1))

    for rrow in range(rows - height + 1):
        for ccolumn in range(cols - width + 1):
            imgval = image[rrow:rrow + height, ccolumn:ccolumn + width]
            filterval = weightArray[:, :]
            output[rrow, ccolumn] = sum(sum(imgval * filterval))
                    
    return output

## vectorize the image loop (~50x improvement)
def applyFilterMethod3(image: np.array, weightArray: np.array) -> np.array:
    rows, cols = image.shape ; height, width = weightArray.shape
    output = np.zeros((rows - height + 1, cols - width + 1))

    for hheight in range(height):
        for wwidth in range(width):
            imgval = 0 ## TODO -- construct a compatible slice
            filterval = weightArray[hheight, wwidth]
            output[:, :] += imgval * filterval
                    
    return output

src = Image.open("input.png")
sb = np.asarray(src)
cb = np.array([[1,2,1],[2,4,2],[1,2,1]])
cb = cb/sum(sum(cb)) ## normalize

db = applyFilterMethod2(sb, cb)

dst = Image.fromarray(db)
dst.convert("L").save("output.png")
#src.show() ; dst.show()

注意：您可能会删除所有四个 for 循环，但会增加一些复杂性。然而，因为这只会消除 9 次迭代的开销（在这个例子中），我不估计它会比applyFilterMethod3 产生任何额外的性能提升。此外，虽然我没有尝试过，但我想象的完成方式可能会增加比它消除的开销更多的开销。

仅供参考：这是标准图像卷积（仅支持已实现的灰度）。我总是想指出，为了在数学上正确，这需要补偿几乎所有默认图像编码中隐含的gamma compression——但这个小细节经常被忽略。

讨论

这种类型的向量化在 Python 中通常是必需的，特别是因为标准 Python 解释器是 extremely inefficient 处理大型 for 循环。因此，显式迭代图像的每个像素会浪费大量时间。但最终，矢量化实现不会改变实际执行的工作量，所以我们只是在讨论消除算法的overhead 方面。

不过，矢量化还有一个好处：Parallelization。将大量数据处理集中到单个运算符上使语言/库在如何优化执行方面具有更大的灵活性。这可能包括在 GPU 上执行 embarrassingly parallel 操作——如果你有合适的工具，例如 Tensorflow image module。

Python 对array programming 的无缝支持是它在机器学习中非常受欢迎的原因之一，机器学习可能是计算密集型的。

解决方案

这是imgval 分配的解决方案，留作上面的练习。

imgval = image[hheight:hheight+rows - height+1, wwidth:wwidth+cols - width +1]

【讨论】：

因为这似乎是一个家庭作业问题，所以我留下了不完整的解决方案。稍后我会回来填写TODO 部分。 Hint1：与applyFilterMethod2的一般形式相同，但翻译并不简单。 提示 2：如果你弄错了，Python 解释器应该给你关于数组操作中切片不兼容的详细信息。
虽然这可能是某人的功课，但所涉及的概念可能与未来的访问者相关。 Method2 和 Method3 中的矢量化分别产生 3 倍和 50 倍的性能增益。
What is “vectorization”?
What are vectors and how are they used in programming?

【解决方案2】：

您可以构造图像的切片视图数组，每个视图按权重数组的索引移动，然后将其乘以权重并求和。

def apply_filter(image: np.array, weights: np.array) -> np.array:
    height, width = weights.shape
    indices = np.indices(weights.shape).T.reshape(weights.size, 2)
    views = np.array([image[r:-height+r,c:-width+c] for r, c in indices])
    return np.inner(views.T, weights.T.flatten()).T  # sum product

（我必须在几个点进行转置和重塑，才能将数据变成所需的形状和顺序。可能有更简单的方法。）

仍然有一个偷偷摸摸的for 循环，其形式是对权重索引的列表理解，但我们将 for 循环内的操作最小化以创建一组切片视图。使用sliding_window_view 可以避免循环，但不清楚这是否会提高性能；或stride_tricks.as_strided（参见this question的答案）。

【讨论】：