简单的 GLSL 卷积着色器速度非常慢答案

【问题标题】：Simple GLSL convolution shader is atrociously slow简单的 GLSL 卷积着色器速度非常慢
【发布时间】：2012-09-18 03:27:17
【问题描述】：

我正在尝试在 OpenGL ES2.0 for iOS 中实现 2D 轮廓着色器。它非常缓慢。就像 5fps 慢一样。我已经追踪到 texture2D() 调用。然而，没有这些，任何卷积着色器都是不可撤销的。我试过用lowp代替mediump，但一切都是黑色的，虽然它确实给了另外5fps，但它仍然无法使用。

这是我的片段着色器。

    varying mediump vec4 colorVarying;
    varying mediump vec2 texCoord;

    uniform bool enableTexture;
    uniform sampler2D texture;

    uniform mediump float k;

    void main() {

        const mediump float step_w = 3.0/128.0;
        const mediump float step_h = 3.0/128.0;
        const mediump vec4 b = vec4(0.0, 0.0, 0.0, 1.0);
        const mediump vec4 one = vec4(1.0, 1.0, 1.0, 1.0);

        mediump vec2 offset[9];
        mediump float kernel[9];
        offset[0] = vec2(-step_w, step_h);
        offset[1] = vec2(-step_w, 0.0);
        offset[2] = vec2(-step_w, -step_h);
        offset[3] = vec2(0.0, step_h);
        offset[4] = vec2(0.0, 0.0);
        offset[5] = vec2(0.0, -step_h);
        offset[6] = vec2(step_w, step_h);
        offset[7] = vec2(step_w, 0.0);
        offset[8] = vec2(step_w, -step_h);

        kernel[0] = kernel[2] = kernel[6] = kernel[8] = 1.0/k;
        kernel[1] = kernel[3] = kernel[5] = kernel[7] = 2.0/k;
        kernel[4] = -16.0/k;  

        if (enableTexture) {
              mediump vec4 sum = vec4(0.0);
            for (int i=0;i<9;i++) {
                mediump vec4 tmp = texture2D(texture, texCoord + offset[i]);
                sum += tmp * kernel[i];
            }

            gl_FragColor = (sum * b) + ((one-sum) * texture2D(texture, texCoord));
        } else {
            gl_FragColor = colorVarying;
        }
    }

这是未优化的，也未最终确定，但我需要先提高性能，然后再继续。我试过用一个纯 vec4 替换循环中的 texture2D() 调用，它运行没有问题，尽管其他一切都在发生。

我该如何优化呢？我知道这是可能的，因为我在 3D 运行中看到了更多涉及的效果，没有问题。我完全不明白为什么这会造成任何麻烦。

【问题讨论】：

"我尝试将循环中的 texture2D() 调用替换为纯 vec4 并且运行没有问题" 这是什么意思？它变得更快了吗？它没有改变性能吗？发生了什么？
"我完全不明白为什么这会造成任何麻烦。" 您在每个着色器调用中执行 十次纹理访问，并且你看不出是什么导致了问题？此外，您访问中心纹素两次。
在没有纹理查找的情况下（不包括最后一个），我得到了稳定的 60fps。正如我所说，它没有优化，但没有办法避免这些纹理调用。否则过滤器无法工作。但是我见过很多游戏，无论是手机游戏还是非手机游戏，都使用基于卷积过滤器的效果，而且它们似乎没有任何问题。除非有一些技巧可以避免它们？

标签： opengl-es filter opengl-es-2.0 glsl convolution

【解决方案1】：

我自己已经完成了这件事，并且我看到了一些可以在这里优化的东西。

首先，我会删除enableTexture 条件，而是将您的着色器分成两个程序，一个用于此状态的真实状态，一个用于虚假状态。在 iOS 片段着色器中，条件非常昂贵，尤其是在其中具有纹理读取的那些。

其次，这里有九个相关的纹理读取。这些是纹理读取，其中纹理坐标在片段着色器中计算。在 iOS 设备中的 PowerVR GPU 上，依赖纹理读取非常昂贵，因为它们会阻止硬件使用缓存等优化纹理读取。因为您从 8 个周围像素和一个中心像素的固定偏移量进行采样，所以这些计算应该是向上移动到顶点着色器。这也意味着不必为每个像素执行这些计算，只需为每个顶点执行一次，然后硬件插值将处理其余部分。

第三，迄今为止，iOS 着色器编译器还没有很好地处理 for() 循环，所以我倾向于避免那些我可以的地方。

正如我所提到的，我已经在我的开源 iOS GPUImage 框架中完成了这样的卷积着色器。对于通用卷积过滤器，我使用以下顶点着色器：

 attribute vec4 position;
 attribute vec4 inputTextureCoordinate;

 uniform highp float texelWidth; 
 uniform highp float texelHeight; 

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     gl_Position = position;

     vec2 widthStep = vec2(texelWidth, 0.0);
     vec2 heightStep = vec2(0.0, texelHeight);
     vec2 widthHeightStep = vec2(texelWidth, texelHeight);
     vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight);

     textureCoordinate = inputTextureCoordinate.xy;
     leftTextureCoordinate = inputTextureCoordinate.xy - widthStep;
     rightTextureCoordinate = inputTextureCoordinate.xy + widthStep;

     topTextureCoordinate = inputTextureCoordinate.xy - heightStep;
     topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep;
     topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep;

     bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep;
     bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep;
     bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep;
 }

以及以下片段着色器：

 precision highp float;

 uniform sampler2D inputImageTexture;

 uniform mediump mat3 convolutionMatrix;

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     mediump vec4 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate);
     mediump vec4 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate);
     mediump vec4 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate);
     mediump vec4 centerColor = texture2D(inputImageTexture, textureCoordinate);
     mediump vec4 leftColor = texture2D(inputImageTexture, leftTextureCoordinate);
     mediump vec4 rightColor = texture2D(inputImageTexture, rightTextureCoordinate);
     mediump vec4 topColor = texture2D(inputImageTexture, topTextureCoordinate);
     mediump vec4 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate);
     mediump vec4 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate);

     mediump vec4 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2];
     resultColor += leftColor * convolutionMatrix[1][0] + centerColor * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2];
     resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2];

     gl_FragColor = resultColor;
 }

texelWidth 和 texelHeight 统一是输入图像的宽度和高度的倒数，convolutionMatrix 统一指定卷积中各种样本的权重。

在 iPhone 4 上，对于 640x480 帧的相机视频，这在 4-8 毫秒内运行，这对于以该图像大小进行 60 FPS 渲染来说已经足够了。如果您只需要做边缘检测之类的事情，您可以简化上述操作，在预通中将图像转换为亮度，然后仅从一个颜色通道中采样。这甚至更快，在同一设备上每帧大约 2 毫秒。

【讨论】：

很好的例子。 tl;dr：避免依赖纹理读取。还努力通过两次渲染来测试可分离卷积，以减少提取次数（尽管对于 9 的这种示例，它不会减少到一半以下，因此在这种情况下，两遍方法可能是个坏主意）
@StevenLu - 一旦在许多这些 GPU 上一次通过超过 9 次左右的纹理读取，性能就会出现惊人的急剧下降。与单遍中的样本数量相比，将其分成两遍会对性能产生非线性影响。我已经测试过，一次运行它比分离内核要慢得多，即使对于这么少量的样本也是如此。
有没有办法同时获取纹理的一个区域，而不是单个像素？
@AlexGonçalves - 在片段着色器中？不，texture2D() 一次只采样一个像素。
@CrearoRotar - 在现代设备上，3x3 和 5x5 卷积之间的性能差异可能不会太大。您将无法像我上面那样使用变量，因为您将超过大多数 iOS 硬件支持的最大变量数。对于不清晰的蒙版，我可能会建议使用可分离的高斯模糊，然后使用自定义着色器来混合像素，就像我做的 here 一样。这有助于减少大面积的样本数量并加快流程。

【解决方案2】：

我知道减少此着色器所用时间的唯一方法是减少纹理提取次数。由于您的着色器从围绕中心像素的等距点采样纹理并将它们线性组合，因此您可以通过使用 GL_LINEAR 模式进行纹理采样来减少提取次数。

基本上不是在每个纹素上采样，而是在一对纹素之间进行采样，以直接获得线性加权和。

让我们将偏移量 (-stepw,-steph) 和 (-stepw,0) 处的采样分别称为 x0 和 x1。那么你的总和是

sum = x0*k0 + x1*k1

现在，如果您在这两个纹素之间进行采样，距离为 k0/(k0+k1) 来自 x0，因此k1/(k0+k1) 来自 x1，那么 GPU 将在获取期间执行线性加权并给你，

y = x1*k1/(k0+k1) + x0*k0/(k1+k0)

因此总和可以计算为

sum = y*(k0 + k1) 只需一次获取！

如果对其他相邻像素重复此操作，您最终会为每个相邻偏移执行 4 次纹理提取，并为中心像素执行额外的纹理提取。

link 更好地解释了这一点

【讨论】：