使用 SIMD 优化图像大小调整（方法 Nearest）答案

【问题标题】：Optimization of image resizing (method Nearest) with using SIMD使用 SIMD 优化图像大小调整（方法 Nearest）
【发布时间】：2022-01-11 14:28:27
【问题描述】：

我知道调整图像大小的“最近”方法是最快的方法。尽管如此，我还是在寻找加速它的方法。明显的步骤是预先计算的索引：

void CalcIndex(int sizeS, int sizeD, int colors, int* idx)
{
    float scale = (float)sizeS / sizeD;
    for (size_t i = 0; i < sizeD; ++i)
    {
        int index = (int)::floor((i + 0.5f) * scale)
        idx[i] = Min(Max(index, 0), sizeS - 1) * colors;
    }
}

template<int colors> inline void CopyPixel(const uint8_t* src, uint8_t* dst)
{
    for (int i = 0; i < colors; ++i)
        dst[i] = src[i];
}

template<int colors> void Resize(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices (see CalcIndex).
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * colors;
        for (int dx = 0, offset = 0; dx < dstW; dx++, offset += colors)
            CopyPixel<N>(srcY + idxX[dx], dst + offset);
        dst += dstW * colors;
    }
}

是否存在下一个优化步骤？例如使用 SIMD 或其他一些优化技术。

附：特别是我对 RGB 的优化很感兴趣（Colors = 3）。如果我使用当前代码，我发现 ARGB 图像 (Colors = 4) 的处理速度比 RGB 快 50%，尽管它比 RGB 大 30%。

【问题讨论】：

如果你真的关心性能，我会考虑做这个硬件加速，所以使用例如DirectX 或 Vulkan。
@Savage 'Nearest' 方法很快。如果我将在 CPU 上执行所有计算，我担心将图像内存复制到 GPU 并返回会更慢。
你可以看看 opencv 是怎么做的：github.com/opencv/opencv/blob/master/modules/imgproc/src/… github.com/opencv/opencv/blob/master/modules/imgproc/src/… 对于某些格式，opencv 也调用 IPP 函数或没有显式 SIMD 代码的实现。如果您真的想尝试自己编写，至少 opencv/IPP 实现应该为您提供一个很好的基准来进行基准测试。
你在哪里计算idxY和idxX？重要的是要知道这些来自哪里以及它们是如何排序的，以便为您提供最好的帮助。
@CrisLuengo 我添加了索引计算功能。

标签： c++ image-processing simd simd-library synet

【解决方案1】：

我认为使用 _mm256_i32gather_epi32 (AVX2) 可以在 32 位像素的情况下为调整大小提供一些性能提升：

inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
    _mm256_storeu_si256((__m256i*)dst, val);
}

template<> void Resize<4>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * 4;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx += 8, offset += 8*4)
            Gather32bit(srcY, idxX + dx,dst + offset);
        for (; dx < dstW; dx++, offset += 4)
            CopyPixel<N>(srcY + idxX[dx], dst + offset);
        dst += dstW * 4;
    }
}

附：经过一些修改，此方法可以应用于 RGB24：

const __m256i K8_SHUFFLE = _mm256_setr_epi8(
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);


inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
    __m256i bgr = _mm256_permutevar8x32_epi32(
        _mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
    _mm256_storeu_si256((__m256i*)dst, bgr);
}

template<> void Resize<3>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * 3;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx += 8, offset += 8*3)
            Gather24bit(srcY, idxX + dx,dst + offset);
        for (; dx < dstW; dx++, offset += 3)
            CopyPixel<3>(srcY + idxX[dx], dst + offset);
        dst += dstW * 3;
    }
}

注意如果srcW < dstW 那么@Aki-Suihkonen 的方法更快。

【讨论】：

我希望_mm256_i32gather_epi32 只有在调整到如此小的尺寸以致对于 RGBA 仅水平保留每 4 个源像素中的 1 个或更少时才会有所帮助。（因此，正常的非聚集矢量负载仍然只能获得 1 个有用像素）。或者也许通过快速聚集（如 Skylake），它可以节省足够的洗牌工作，值得。 Zen2/3 的收集速度较慢，而 IIRC 的 shuffle 吞吐量则更好。
@PeterCordes 我在 Skylake 上进行了测量。在srcW > dstW 的情况下，对 RGB 使用 _mm256_i32gather_epi32 可以获得 2.0-2.5 的性能增益（与标量代码相比）。
好吧，很公平，但如果连续 32 字节源数据块中平均少于 2 个像素，则可能只有最佳。我想收集负载可以让你做一个连续的存储，并且只有负载吞吐量的瓶颈（2/时钟）而不是存储（SKL 上的 1/时钟），以及减少循环开销与标量，所以我并不感到震惊加速与标量。但是，例如 2:1 的缩小比例，您从每隔一行中获取每个其他像素应该比使用该比例的专门实现更快。甚至可能使用 Soonts 的非专业 shuffle 生成器。

【解决方案2】：

（基于 SIMD 的）调整大小算法中的速度问题来自索引输入和输出元素的不匹配。当例如调整大小因子为 6/5，需要消耗 6 个像素并写入 5。OTOH SIMD 寄存器宽度为 16 个字节，映射到 16 个灰度元素、4 个 RGBA 元素或 5.33 个 RGB 元素。

我的经验是，当尝试一次写入 2-4 个 SIMD 寄存器的数据，读取所需的线性字节数时，性能足够好（可能不是最佳的，但经常击败 opencv 和其他免费提供的实现）从输入 + 一些，并在 x86 SSSE3 中使用 pshufb 或在 Neon 中使用 vtbl 从寄存器收集负载 - 从不从内存中收集负载。当然需要一种快速的机制来计算 LUT 索引内联，或预先计算索引，这些索引在不同的输出行之间共享。

应该根据（水平）分辨率的输入/输出比准备好几个内核。

RGBrgbRGBrgbRGBr|gbRGBrgb ....  <- input
                         ^ where to load next 32 bytes of input
RGBRGBrgbRGBrgbr|gbRGBrgbRGBRGBrg| <- 32 output bytes, from 

0000000000000000|0000001111111111| <- high bit of index
0120123456789ab9|abcdef0123423456| <- low 4 bits of index

注意，使用 LUT 方法可以处理所有通道计数

// inner kernel for downsampling between 1x and almost 2x*
// - we need to read max 32 elements and write 16
void process_row_ds(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto b = load16_bytes(input + 16);
       auto c = load16_bytes(indices);
       a = lut32(a,b,c);      // get 16 bytes out of 32
       store16_bytes(output, a);
       output += 16;
       input += *advances++;
    } while (out_width--);  // multiples of 16...
}

// inner kernel for upsampling between 1x and inf
void process_row_us(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto c = load16_bytes(indices);
       a = lut16(a, c);      // get 16 bytes out of 16
       store16_bytes(output, a);
       output += 16;
       input += *advances++;
    } while (out_width--);
}

我还鼓励使用一些基本过滤进行下采样，例如高斯二项式内核(1 1, 1 2 1, 1 3 3 1, 1 4 6 4 1, ...) 除了（至少）双线性插值之外的分层下采样。当然，应用程序可能会容忍混叠伪影——AFAIK 的成本通常不会那么大，尤其是考虑到否则算法将受到内存限制。

【讨论】：

【解决方案3】：

使用 SIMD 是可能的，我很确定它会有所帮助，但不幸的是它相对困难。下面是一个简化的例子，只支持图片放大，不支持缩小。

不过，我希望它可以作为一个有用的起点。

MSVC和GCC都将LineResize::apply方法中的热循环编译成11条指令。我认为 16 字节的 11 条指令应该比你的版本快。

#include <stdint.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <vector>
#include <array>
#include <assert.h>
#include <stdio.h>

// Implements nearest neighbor resize method for RGB24 or BGR24 bitmaps
class LineResize
{
    // Each mask produces up to 16 output bytes.
    // For enlargement exactly 16, for shrinking up to 16, possibly even 0.
    std::vector<__m128i> masks;

    // Length is the same as masks.
    // For enlargement, the values contain source pointer offsets in bytes.
    // For shrinking, the values contain destination pointer offsets in bytes.
    std::vector<uint8_t> offsets;

    // True if this class will enlarge images, false if it will shrink the width of the images.
    bool enlargement;

    void resizeFields( size_t vectors )
    {
        masks.resize( vectors, _mm_set1_epi32( -1 ) );
        offsets.resize( vectors, 0 );
    }

public:

    // Compile the shuffle table. The arguments are line widths in pixels.
    LineResize( size_t source, size_t dest );

    // Apply the algorithm to a single line of the image.
    void apply( uint8_t* rdi, const uint8_t* rsi ) const;
};

LineResize::LineResize( size_t source, size_t dest )
{
    const size_t sourceBytes = source * 3;
    const size_t destBytes = dest * 3;
    assert( sourceBytes >= 16 );
    assert( destBytes >= 16 );

    // Possible to do much faster without any integer divides.
    // Optimizing this sample for simplicity.
    if( sourceBytes < destBytes )
    {
        // Enlarging the image, each SIMD vector consumes <16 input bytes, produces exactly 16 output bytes
        enlargement = true;
        resizeFields( ( destBytes + 15 ) / 16 );

        int8_t* pMasks = (int8_t*)masks.data();
        uint8_t* const pOffsets = offsets.data();

        int sourceOffset = 0;
        const size_t countVectors = masks.size();
        for( size_t i = 0; i < countVectors; i++ )
        {
            const int destSlice = (int)i * 16;
            std::array<int, 16> lanes;
            int lane;
            for( lane = 0; lane < 16; lane++ )
            {
                const int destByte = destSlice + lane;  // output byte index
                const int destPixel = destByte / 3; // output pixel index
                const int channel = destByte % 3;   // output byte within pixel
                const int sourcePixel = destPixel * (int)source / (int)dest; // input pixel
                const int sourceByte = sourcePixel * 3 + channel;   // input byte

                if( destByte < (int)destBytes )
                    lanes[ lane ] = sourceByte;
                else
                {
                    // Destination offset out of range, i.e. the last SIMD vector
                    break;
                }
            }

            // Produce the offset
            if( i == 0 )
                assert( lanes[ 0 ] == 0 );
            else
            {
                const int off = lanes[ 0 ] - sourceOffset;
                assert( off >= 0 && off <= 16 );
                pOffsets[ i - 1 ] = (uint8_t)off;
                sourceOffset = lanes[ 0 ];
            }

            // Produce the masks
            for( int j = 0; j < lane; j++ )
                pMasks[ j ] = (int8_t)( lanes[ j ] - sourceOffset );
            // The masks are initialized with _mm_set1_epi32( -1 ) = all bits set,
            // no need to handle remainder for the last vector.
            pMasks += 16;
        }
    }
    else
    {
        // Shrinking the image, each SIMD vector consumes 16 input bytes, produces <16 output bytes
        enlargement = false;
        resizeFields( ( sourceBytes + 15 ) / 16 );

        // Not implemented, but the same idea works fine for this too.
        // The only difference, instead of using offsets bytes for source offsets, use it for destination offsets.
        assert( false );
    }
}

void LineResize::apply( uint8_t * rdi, const uint8_t * rsi ) const
{
    const __m128i* pm = masks.data();
    const __m128i* const pmEnd = pm + masks.size();
    const uint8_t* po = offsets.data();
    __m128i mask, source;

    if( enlargement )
    {
        // One iteration of the loop produces 16 output bytes
        // In MSVC results in 11 instructions for 16 output bytes.
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm++;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi += *po;
            po++;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi += 16;
        }
    }
    else
    {
        // One iteration of the loop consumes 16 input bytes
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm++;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi += 16;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi += *po;
            po++;
        }
    }
}

// Utility method to print RGB pixel values from the vector
static void printPixels( const std::vector<uint8_t>&vec )
{
    assert( !vec.empty() );
    assert( 0 == ( vec.size() % 3 ) );

    const uint8_t* rsi = vec.data();
    const uint8_t* const rsiEnd = rsi + vec.size();
    while( rsi < rsiEnd )
    {
        const uint32_t r = rsi[ 0 ];
        const uint32_t g = rsi[ 1 ];
        const uint32_t b = rsi[ 2 ];
        rsi += 3;
        const uint32_t res = ( r << 16 ) | ( g << 8 ) | b;
        printf( "%06X ", res );
    }
    printf( "\n" );
}

// A triviual test to resize 24 pixels -> 32 pixels
int main()
{
    constexpr int sourceLength = 24;
    constexpr int destLength = 32;

    // Initialize sample input with 24 RGB pixels
    std::vector<uint8_t> input( sourceLength * 3 );
    for( size_t i = 0; i < input.size(); i++ )
        input[ i ] = (uint8_t)i;

    printf( "Input: " );
    printPixels( input );

    // That special handling of the last pixels of last line is missing from this example.
    static_assert( 0 == destLength % 16 );
    LineResize resizer( sourceLength, destLength );

    std::vector<uint8_t> result( destLength * 3 );
    resizer.apply( result.data(), input.data() );

    printf( "Output: " );
    printPixels( result );
    return 0;
}

代码忽略对齐问题。对于生产，您需要另一种方法来处理图像的最后一行，它不会运行到最后，而是使用标量代码处理最后几个像素。

代码在热循环中包含更多的内存引用。但是，该类中的两个向量并不太长，对于 4k 图像，大小约为 12kb，应该适合 L1D 缓存并保留在那里。

如果您有 AVX2，可能会进一步改进。对于放大图像，使用_mm256_inserti128_si256，vinserti128 指令可以将 16 个字节从内存加载到向量的高半部分。同样，对于缩小图像，使用_mm256_extracti128_si256，该指令具有使用内存目标的选项。

【讨论】：