快速阈值和位打包算法（可能的改进？）答案

【问题标题】：fast threshold and bit packing algorithm ( possible improvements ? )快速阈值和位打包算法（可能的改进？）
【发布时间】：2011-04-11 22:39:00
【问题描述】：

我正在研究一种算法，该算法将 8 位灰度图像全局阈值化为 1 位（位压缩，这样 1 字节包含 8 个像素）单色图像。灰度图像中每个像素的亮度值可以为 0 - 255。

我的环境是 Microsoft Visual Studio C++ 中的 Win32。

出于好奇，我有兴趣尽可能优化算法，1-bit 图像将变成 TIFF。目前我将 FillOrder 设置为 MSB2LSB（最高有效位到最低有效位）只是因为 TIFF 规范建议这样做（它不一定需要是 MSB2LSB）

只是为那些不知道的人提供一些背景：

MSB2LSB 在一个字节中从左到右对像素进行排序，就像像素在图像中随着 X 坐标的增加而定向一样。如果您在 X 轴上从左到右遍历灰度图像，这显然需要您在将位打包到当前字节中时“向后”思考。话虽如此，让我向您展示我目前拥有的东西（这是在 C 中，我还没有尝试过 ASM 或 Compiler Intrinsics 只是因为我对它的经验很少，但这是可能的）。

因为单色图像每个字节有 8 个像素，所以单色图像的宽度将为

(灰度宽度+7)/8;

仅供参考，我假设我的最大图像为 6000 像素宽：

我做的第一件事（在处理任何图像之前）是

1) 在给定灰度图像的 X 坐标的情况下，计算我需要转移到特定字节的数量的查找表：

int _shift_lut[6000];

for( int x = 0 ; x < 6000; x++)
{ 
    _shift_lut[x] = 7-(x%8);
}

通过这个查找表，我可以将单色位值打包到我正在处理的当前字节中：

monochrome_pixel |= 1 << _shift_lut[ grayX ];

这最终是一个巨大的速度提升比做

monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)];

我计算的第二个查找表是一个查找表，它告诉我在给定灰度像素上的 X 像素的情况下，我的单色像素的 X 索引。这个非常简单的 LUT 是这样计算的：

int xOffsetLut[6000];
int element_size=8; //8 bits
for( int x = 0; x < 6000; x++)
{
    xOffsetLut[x]=x/element_size;
}

这个 LUT 允许我做类似的事情

monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels

我的灰度图像是一个简单的无符号字符*，我的单色图像也是；

这是我初始化单色图像的方式：

int bitPackedScanlineStride = (grayscaleWidth+7)/8;
int bitpackedLength=bitPackedScanlineStride * grayscaleHeight;
unsigned char * bitpack_image = new unsigned char[bitpackedLength];
memset(bitpack_image,0,bitpackedLength);

然后我像这样调用我的二值化函数：

binarize(
    gray_image.DataPtr(),
    bitpack_image,
    globalFormThreshold,
    grayscaleWidth,
    grayscaleHeight,
    bitPackedScanlineStride,
    bitpackedLength,
    _shift_lut,  
    xOffsetLut);

这是我的 Binarize 函数（如您所见，我做了一些循环展开，这可能有帮助，也可能没有帮助）。

void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int  bitPackedScanlineStride, int bitpackedLength,  int shiftLUT[], int xOffsetLUT[] )
{
    int yoff;
    int byoff;
    unsigned char bitpackPel=0;
    unsigned char pel1=0;
    unsigned char  pel2=0;
    unsigned char  pel3=0;
    unsigned char  pel4=0;
    unsigned char  pel5=0;
    unsigned char  pel6=0;
    unsigned char  pel7=0;
    unsigned char  pel8=0;
    int checkX=grayscaleWidth;
    int checkY=grayscaleHeight;

    for ( int by = 0 ; by < checkY; by++)
    {
    yoff=by*grayscaleWidth;
    byoff=by*bitPackedScanlineStride;

    for( int bx = 0; bx < checkX; bx+=32)
    {
        bitpackPel = 0;

        //pixel 1 in bitpack image
        pel1=grayImage[yoff+bx];
        pel2=grayImage[yoff+bx+1];
        pel3=grayImage[yoff+bx+2];
        pel4=grayImage[yoff+bx+3];
        pel5=grayImage[yoff+bx+4];
        pel6=grayImage[yoff+bx+5];
        pel7=grayImage[yoff+bx+6];
        pel8=grayImage[yoff+bx+7];

        bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]);
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );

        bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;

        //pixel 2 in bitpack image
        pel1=grayImage[yoff+bx+8];
        pel2=grayImage[yoff+bx+9];
        pel3=grayImage[yoff+bx+10];
        pel4=grayImage[yoff+bx+11];
        pel5=grayImage[yoff+bx+12];
        pel6=grayImage[yoff+bx+13];
        pel7=grayImage[yoff+bx+14];
        pel8=grayImage[yoff+bx+15];

        bitpackPel = 0;

        bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8]  );
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9]  );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );

        bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;

        //pixel 3 in bitpack image
        pel1=grayImage[yoff+bx+16];
        pel2=grayImage[yoff+bx+17];
        pel3=grayImage[yoff+bx+18];
        pel4=grayImage[yoff+bx+19];
        pel5=grayImage[yoff+bx+20];
        pel6=grayImage[yoff+bx+21];
        pel7=grayImage[yoff+bx+22];
        pel8=grayImage[yoff+bx+23];

        bitpackPel = 0;

        bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16]  );
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17]  );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );

        bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;

        //pixel 4 in bitpack image
        pel1=grayImage[yoff+bx+24];
        pel2=grayImage[yoff+bx+25];
        pel3=grayImage[yoff+bx+26];
        pel4=grayImage[yoff+bx+27];
        pel5=grayImage[yoff+bx+28];
        pel6=grayImage[yoff+bx+29];
        pel7=grayImage[yoff+bx+30];
        pel8=grayImage[yoff+bx+31];

        bitpackPel = 0;

        bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24]  );
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25]  );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );

        bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
    }
}
}

我知道这个算法可能会丢失每行中的一些尾随像素，但不要担心。

正如您所见，对于每个单色字节，我处理 8 个灰度像素。

你看到的地方 pel8 快得多

对于 X 的每一个增量，我都会将一个位打包到比前一个 X 更高的位中

所以对于灰度图像中的第一组8个像素

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

这就是字节中的位的样子（显然每个编号位只是处理相应编号像素的阈值结果，但你明白了）

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

PHEW 应该是这样的。随意玩一些有趣的小技巧，从这个算法中榨取更多的汁液。

启用编译器优化后，在 core 2 duo 机器上，在大约 5000 x 2200 像素的图像上，此函数平均需要 16 毫秒。

编辑：

R.. 的建议是删除移位 LUT 并仅使用实际上完全合乎逻辑的常量...我已将每个像素的 OR'ing 修改为这样：

void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int  bitPackedScanlineStride, int bitpackedLength,  int shiftLUT[], int xOffsetLUT[] )
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char  pel2=0;
unsigned char  pel3=0;
unsigned char  pel4=0;
unsigned char  pel5=0;
unsigned char  pel6=0;
unsigned char  pel7=0;
unsigned char  pel8=0;
int checkX=grayscaleWidth-32;
int checkY=grayscaleHeight;

for ( int by = 0 ; by < checkY; by++)
{
    yoff=by*grayscaleWidth;
    byoff=by*bitPackedScanlineStride;

    for( int bx = 0; bx < checkX; bx+=32)
    {
        bitpackPel = 0;

        //pixel 1 in bitpack image
        pel1=grayImage[yoff+bx];
        pel2=grayImage[yoff+bx+1];
        pel3=grayImage[yoff+bx+2];
        pel4=grayImage[yoff+bx+3];
        pel5=grayImage[yoff+bx+4];
        pel6=grayImage[yoff+bx+5];
        pel7=grayImage[yoff+bx+6];
        pel8=grayImage[yoff+bx+7];

        /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]);
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );*/
        bitpackPel |= ( (pel1<=threshold) << 7);
        bitpackPel |= ( (pel2<=threshold) << 6 );
        bitpackPel |= ( (pel3<=threshold) << 5 );
        bitpackPel |= ( (pel4<=threshold) << 4 );
        bitpackPel |= ( (pel5<=threshold) << 3 );
        bitpackPel |= ( (pel6<=threshold) << 2 );
        bitpackPel |= ( (pel7<=threshold) << 1 );
        bitpackPel |= ( (pel8<=threshold)  );

        bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;

        //pixel 2 in bitpack image
        pel1=grayImage[yoff+bx+8];
        pel2=grayImage[yoff+bx+9];
        pel3=grayImage[yoff+bx+10];
        pel4=grayImage[yoff+bx+11];
        pel5=grayImage[yoff+bx+12];
        pel6=grayImage[yoff+bx+13];
        pel7=grayImage[yoff+bx+14];
        pel8=grayImage[yoff+bx+15];

        bitpackPel = 0;

        /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8]  );
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9]  );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );*/
         bitpackPel |= ( (pel1<=threshold) << 7);
        bitpackPel |= ( (pel2<=threshold) << 6 );
        bitpackPel |= ( (pel3<=threshold) << 5 );
        bitpackPel |= ( (pel4<=threshold) << 4 );
        bitpackPel |= ( (pel5<=threshold) << 3 );
        bitpackPel |= ( (pel6<=threshold) << 2 );
        bitpackPel |= ( (pel7<=threshold) << 1 );
        bitpackPel |= ( (pel8<=threshold)  );


        bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;

        //pixel 3 in bitpack image
        pel1=grayImage[yoff+bx+16];
        pel2=grayImage[yoff+bx+17];
        pel3=grayImage[yoff+bx+18];
        pel4=grayImage[yoff+bx+19];
        pel5=grayImage[yoff+bx+20];
        pel6=grayImage[yoff+bx+21];
        pel7=grayImage[yoff+bx+22];
        pel8=grayImage[yoff+bx+23];

        bitpackPel = 0;

        /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16]  );
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17]  );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );*/
          bitpackPel |= ( (pel1<=threshold) << 7);
        bitpackPel |= ( (pel2<=threshold) << 6 );
        bitpackPel |= ( (pel3<=threshold) << 5 );
        bitpackPel |= ( (pel4<=threshold) << 4 );
        bitpackPel |= ( (pel5<=threshold) << 3 );
        bitpackPel |= ( (pel6<=threshold) << 2 );
        bitpackPel |= ( (pel7<=threshold) << 1 );
        bitpackPel |= ( (pel8<=threshold)  );


        bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;

        //pixel 4 in bitpack image
        pel1=grayImage[yoff+bx+24];
        pel2=grayImage[yoff+bx+25];
        pel3=grayImage[yoff+bx+26];
        pel4=grayImage[yoff+bx+27];
        pel5=grayImage[yoff+bx+28];
        pel6=grayImage[yoff+bx+29];
        pel7=grayImage[yoff+bx+30];
        pel8=grayImage[yoff+bx+31];

        bitpackPel = 0;

        /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24]  );
        bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25]  );
        bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] );
        bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] );
        bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] );
        bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] );
        bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] );
        bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );*/
         bitpackPel |= ( (pel1<=threshold) << 7);
        bitpackPel |= ( (pel2<=threshold) << 6 );
        bitpackPel |= ( (pel3<=threshold) << 5 );
        bitpackPel |= ( (pel4<=threshold) << 4 );
        bitpackPel |= ( (pel5<=threshold) << 3 );
        bitpackPel |= ( (pel6<=threshold) << 2 );
        bitpackPel |= ( (pel7<=threshold) << 1 );
        bitpackPel |= ( (pel8<=threshold)  );


        bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
    }
}
}

我现在使用 (GCC) 4.1.2 在 Intel Xeon 5670 上进行测试。在这些规范下，硬编码的位移比使用我原来的 LUT 算法慢 4 毫秒。在 Xeon 和 GCC 中，LUT 算法平均耗时 8.61 ms，硬编码位移平均耗时 12.285 ms。

【问题讨论】：

你的查找表没用。简单地计算移位（如果你做得正确，而不是使用带有符号整数的% 运算符，这非常慢）比查找表要快得多。或者，更好的是，您可以展开循环并对 8 个班次进行硬编码。通常情况下，常量移位比变量移位要快得多，所以它会有很大帮助。
我已经修改了算法以简单地使用常量位移......它实际上最终比 LUT 慢 4 毫秒。我现在在 GCC 1.4.2 上使用 Intel Xeon。使用 LUT 的算法平均需要 8.61 毫秒，而没有 LUT 的算法平均需要 12.285 毫秒。
+1 到 R..，第二个 lut 同样没用，因为 x/8 将变为 x>>3，这比 *(lut+x) 快，因为您不需要取消引用指针。如果您真的认为疯狂的可移植性是值得的（并且不会被您正在使用的其他构造排除），那么您可以使用x/CHAR_BIT。
@alssandro，这听起来不对，你能发布你用来获得 8.61 和 12.285 的代码吗？
x/8 不会变成x>>3，除非x 是无符号的，或者编译器可以确定x 永远不会是负数。

标签： c image-processing optimization

【解决方案1】：

尝试类似：

unsigned i, w8=w>>3, x;
for (i=0; i<w8; i++) {
    x = thres-src[0]>>1&0x80;
    x |= thres-src[1]>>2&0x40;
    x |= thres-src[2]>>3&0x20;
    x |= thres-src[3]>>4&0x10;
    x |= thres-src[4]>>5&0x08;
    x |= thres-src[5]>>6&0x04;
    x |= thres-src[6]>>7&0x02;
    x |= thres-src[7]>>8&0x01;
    out[i] = x;
    src += 8;
}

您可以找出宽度行末尾的余数不是 8 的倍数的额外代码，或者您可以只是填充/对齐源以确保它是 8 的倍数。

【讨论】：

你确定这些移位不能从 0 到 7，而不是 1 到 8（假设阈值和 src 都是 8 位值）。
是的，我选择了正确的班次值。我正在向下移动第 8 位，而不是第 7 位，因为我想要整数结果中的借位。无论 thres-src[k] 是否以 UINT_MAX+1 为模，第 7 位可能为 0 或 1。

【解决方案2】：

你可以很容易地用 SSE 做到这一点，一次处理 16 个像素，例如

加载向量（16 x 8 位无符号）
向每个元素添加 (255 - 阈值)
使用 PMOVMSKB 将符号位提取到 16 位字中
存储 16 位字

使用 SSE 内在函数的示例代码（警告：未经测试！）：

void threshold_and_pack(
    const uint8_t * in_image,       // input image, 16 byte aligned, height rows x width cols, width = multiple of 16
    uint8_t * out_image,            // output image, 2 byte aligned, height rows x width/8 cols, width = multiple of 2
    const uint8_t threshold,        // threshold
    const int width,
    const int height)
{
    const __m128i vThreshold = _mm_set1_epi8(255 - threshold);
    int i, j;

    for (i = 0; i < height; ++i)
    {
        const __m128i * p_in = (__m128i *)&in_image[i * width];
        uint16_t * p_out = (uint16_t *)&out_image[i * width / CHAR_BIT];

        for (j = 0; j < width; j += 16)
        {
            __m128i v = _mm_load_si128(p_in);
            uint16_t b;

            v = _mm_add_epi8(v, vThreshold);
            b = _mm_movemask_epi8(v);   // use PMOVMSKB to pack sign bits into 16 bit word

            *p_out = b;

            p_in++;
            p_out++;
        }
    }
}

【讨论】：