【发布时间】:2011-04-11 22:39:00
【问题描述】:
我正在研究一种算法,该算法将 8 位灰度图像全局阈值化为 1 位(位压缩,这样 1 字节包含 8 个像素)单色图像。灰度图像中每个像素的亮度值可以为 0 - 255。
我的环境是 Microsoft Visual Studio C++ 中的 Win32。
出于好奇,我有兴趣尽可能优化算法,1-bit 图像将变成 TIFF。目前我将 FillOrder 设置为 MSB2LSB(最高有效位到最低有效位)只是因为 TIFF 规范建议这样做(它不一定需要是 MSB2LSB)
只是为那些不知道的人提供一些背景:
MSB2LSB 在一个字节中从左到右对像素进行排序,就像像素在图像中随着 X 坐标的增加而定向一样。如果您在 X 轴上从左到右遍历灰度图像,这显然需要您在将位打包到当前字节中时“向后”思考。话虽如此,让我向您展示我目前拥有的东西(这是在 C 中,我还没有尝试过 ASM 或 Compiler Intrinsics 只是因为我对它的经验很少,但这是可能的)。
因为单色图像每个字节有 8 个像素,所以单色图像的宽度将为
(灰度宽度+7)/8;
仅供参考,我假设我的最大图像为 6000 像素宽:
我做的第一件事(在处理任何图像之前)是
1) 在给定灰度图像的 X 坐标的情况下,计算我需要转移到特定字节的数量的查找表:
int _shift_lut[6000];
for( int x = 0 ; x < 6000; x++)
{
_shift_lut[x] = 7-(x%8);
}
通过这个查找表,我可以将单色位值打包到我正在处理的当前字节中:
monochrome_pixel |= 1 << _shift_lut[ grayX ];
这最终是一个巨大的速度提升比做
monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)];
我计算的第二个查找表是一个查找表,它告诉我在给定灰度像素上的 X 像素的情况下,我的单色像素的 X 索引。这个非常简单的 LUT 是这样计算的:
int xOffsetLut[6000];
int element_size=8; //8 bits
for( int x = 0; x < 6000; x++)
{
xOffsetLut[x]=x/element_size;
}
这个 LUT 允许我做类似的事情
monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels
我的灰度图像是一个简单的无符号字符*,我的单色图像也是;
这是我初始化单色图像的方式:
int bitPackedScanlineStride = (grayscaleWidth+7)/8;
int bitpackedLength=bitPackedScanlineStride * grayscaleHeight;
unsigned char * bitpack_image = new unsigned char[bitpackedLength];
memset(bitpack_image,0,bitpackedLength);
然后我像这样调用我的二值化函数:
binarize(
gray_image.DataPtr(),
bitpack_image,
globalFormThreshold,
grayscaleWidth,
grayscaleHeight,
bitPackedScanlineStride,
bitpackedLength,
_shift_lut,
xOffsetLut);
这是我的 Binarize 函数(如您所见,我做了一些循环展开,这可能有帮助,也可能没有帮助)。
void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[] )
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char pel2=0;
unsigned char pel3=0;
unsigned char pel4=0;
unsigned char pel5=0;
unsigned char pel6=0;
unsigned char pel7=0;
unsigned char pel8=0;
int checkX=grayscaleWidth;
int checkY=grayscaleHeight;
for ( int by = 0 ; by < checkY; by++)
{
yoff=by*grayscaleWidth;
byoff=by*bitPackedScanlineStride;
for( int bx = 0; bx < checkX; bx+=32)
{
bitpackPel = 0;
//pixel 1 in bitpack image
pel1=grayImage[yoff+bx];
pel2=grayImage[yoff+bx+1];
pel3=grayImage[yoff+bx+2];
pel4=grayImage[yoff+bx+3];
pel5=grayImage[yoff+bx+4];
pel6=grayImage[yoff+bx+5];
pel7=grayImage[yoff+bx+6];
pel8=grayImage[yoff+bx+7];
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]);
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );
bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;
//pixel 2 in bitpack image
pel1=grayImage[yoff+bx+8];
pel2=grayImage[yoff+bx+9];
pel3=grayImage[yoff+bx+10];
pel4=grayImage[yoff+bx+11];
pel5=grayImage[yoff+bx+12];
pel6=grayImage[yoff+bx+13];
pel7=grayImage[yoff+bx+14];
pel8=grayImage[yoff+bx+15];
bitpackPel = 0;
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );
bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;
//pixel 3 in bitpack image
pel1=grayImage[yoff+bx+16];
pel2=grayImage[yoff+bx+17];
pel3=grayImage[yoff+bx+18];
pel4=grayImage[yoff+bx+19];
pel5=grayImage[yoff+bx+20];
pel6=grayImage[yoff+bx+21];
pel7=grayImage[yoff+bx+22];
pel8=grayImage[yoff+bx+23];
bitpackPel = 0;
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );
bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;
//pixel 4 in bitpack image
pel1=grayImage[yoff+bx+24];
pel2=grayImage[yoff+bx+25];
pel3=grayImage[yoff+bx+26];
pel4=grayImage[yoff+bx+27];
pel5=grayImage[yoff+bx+28];
pel6=grayImage[yoff+bx+29];
pel7=grayImage[yoff+bx+30];
pel8=grayImage[yoff+bx+31];
bitpackPel = 0;
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );
bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
}
}
}
我知道这个算法可能会丢失每行中的一些尾随像素,但不要担心。
正如您所见,对于每个单色字节,我处理 8 个灰度像素。
你看到的地方 pel8 快得多
对于 X 的每一个增量,我都会将一个位打包到比前一个 X 更高的位中
所以对于灰度图像中的第一组8个像素
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
这就是字节中的位的样子(显然每个编号位只是处理相应编号像素的阈值结果,但你明白了)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
PHEW 应该是这样的。随意玩一些有趣的小技巧,从这个算法中榨取更多的汁液。
启用编译器优化后,在 core 2 duo 机器上,在大约 5000 x 2200 像素的图像上,此函数平均需要 16 毫秒。
编辑:
R.. 的建议是删除移位 LUT 并仅使用实际上完全合乎逻辑的常量...我已将每个像素的 OR'ing 修改为这样:
void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[] )
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char pel2=0;
unsigned char pel3=0;
unsigned char pel4=0;
unsigned char pel5=0;
unsigned char pel6=0;
unsigned char pel7=0;
unsigned char pel8=0;
int checkX=grayscaleWidth-32;
int checkY=grayscaleHeight;
for ( int by = 0 ; by < checkY; by++)
{
yoff=by*grayscaleWidth;
byoff=by*bitPackedScanlineStride;
for( int bx = 0; bx < checkX; bx+=32)
{
bitpackPel = 0;
//pixel 1 in bitpack image
pel1=grayImage[yoff+bx];
pel2=grayImage[yoff+bx+1];
pel3=grayImage[yoff+bx+2];
pel4=grayImage[yoff+bx+3];
pel5=grayImage[yoff+bx+4];
pel6=grayImage[yoff+bx+5];
pel7=grayImage[yoff+bx+6];
pel8=grayImage[yoff+bx+7];
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]);
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;
//pixel 2 in bitpack image
pel1=grayImage[yoff+bx+8];
pel2=grayImage[yoff+bx+9];
pel3=grayImage[yoff+bx+10];
pel4=grayImage[yoff+bx+11];
pel5=grayImage[yoff+bx+12];
pel6=grayImage[yoff+bx+13];
pel7=grayImage[yoff+bx+14];
pel8=grayImage[yoff+bx+15];
bitpackPel = 0;
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;
//pixel 3 in bitpack image
pel1=grayImage[yoff+bx+16];
pel2=grayImage[yoff+bx+17];
pel3=grayImage[yoff+bx+18];
pel4=grayImage[yoff+bx+19];
pel5=grayImage[yoff+bx+20];
pel6=grayImage[yoff+bx+21];
pel7=grayImage[yoff+bx+22];
pel8=grayImage[yoff+bx+23];
bitpackPel = 0;
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;
//pixel 4 in bitpack image
pel1=grayImage[yoff+bx+24];
pel2=grayImage[yoff+bx+25];
pel3=grayImage[yoff+bx+26];
pel4=grayImage[yoff+bx+27];
pel5=grayImage[yoff+bx+28];
pel6=grayImage[yoff+bx+29];
pel7=grayImage[yoff+bx+30];
pel8=grayImage[yoff+bx+31];
bitpackPel = 0;
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
}
}
}
我现在使用 (GCC) 4.1.2 在 Intel Xeon 5670 上进行测试。在这些规范下,硬编码的位移比使用我原来的 LUT 算法慢 4 毫秒。在 Xeon 和 GCC 中,LUT 算法平均耗时 8.61 ms,硬编码位移平均耗时 12.285 ms。
【问题讨论】:
-
你的查找表没用。简单地计算移位(如果你做得正确,而不是使用带有符号整数的
%运算符,这非常慢)比查找表要快得多。或者,更好的是,您可以展开循环并对 8 个班次进行硬编码。通常情况下,常量移位比变量移位要快得多,所以它会有很大帮助。 -
我已经修改了算法以简单地使用常量位移......它实际上最终比 LUT 慢 4 毫秒。我现在在 GCC 1.4.2 上使用 Intel Xeon。使用 LUT 的算法平均需要 8.61 毫秒,而没有 LUT 的算法平均需要 12.285 毫秒。
-
+1 到 R..,第二个 lut 同样没用,因为
x/8将变为x>>3,这比*(lut+x)快,因为您不需要取消引用指针。如果您真的认为疯狂的可移植性是值得的(并且不会被您正在使用的其他构造排除),那么您可以使用x/CHAR_BIT。 -
@alssandro,这听起来不对,你能发布你用来获得 8.61 和 12.285 的代码吗?
-
x/8不会变成x>>3,除非x是无符号的,或者编译器可以确定x永远不会是负数。
标签: c image-processing optimization