快速计算 __m128i 寄存器中设置的位数答案

【问题标题】：Fast counting the number of set bits in __m128i register快速计算 __m128i 寄存器中设置的位数
【发布时间】：2013-06-25 15:39:20
【问题描述】：

我应该计算 __m128i 寄存器的设置位数。特别是，我应该使用以下方法编写两个能够计算寄存器位数的函数。

寄存器的设置位总数。
寄存器每个字节的设置位数。

是否存在可以全部或部分执行上述操作的内在函数？

【问题讨论】：

最近的 CPU 有一个POPCNT（人口计数）指令； GCC 通过内置的__builtin_popcount 公开它。
请参阅 graphics.stanford.edu/~seander/bithacks.html 了解更多信息。
MS 也有 popcount 函数...见stackoverflow.com/questions/11114017/… ...请注意，这些不一定比 bithacks 快；如果对数组中的位进行计数，一些 bithack 函数会更快。

标签： c sse simd sse2 hammingweight

【解决方案1】：

编辑：我想我不明白 OP 在寻找什么，但我会保留我的答案，以防其他人遇到此问题时有用。

C 提供了一些不错的按位运算。

这是计算整数中设置的位数的代码：

countBitsSet(int toCount)
{
    int numBitsSet = 0;
    while(toCount != 0)
    {
        count += toCount % 2;
        toCount = toCount >> 1;
    }
    return numBitsSet;
}

解释：

toCount % 2

返回整数的最后一位。（除以二并检查余数）。我们将此添加到我们的总计数中，然后将我们的 toCount 值的位移动一位。这个操作应该一直持续到 toCount 中没有更多的位被设置（当 toCount 等于 0 时）

要计算特定字节中的位数，您需要使用掩码。这是一个例子：

countBitsInByte(int toCount, int byteNumber)
{
    int mask = 0x000F << byteNumber * 8
    return countBitsSet(toCount & mask)
}

假设在我们的系统中，我们认为字节 0 是小端系统中的最低有效字节。我们想要创建一个新的 toCount 以通过屏蔽设置为 0 的位来传递给我们之前的 countBitsSet 函数。我们通过将一个满为 1 的字节（由字母 F 表示）移动到我们想要的位置（byteNumber * 8 表示一个字节中的 8 位）并使用我们的 toCount 变量执行按位与运算。

【讨论】：

有有内置函数（映射到 CPU 指令的内在函数，例如 POPCNT），问题是关于计算 128 位 SSE (XMM) 寄存器中的设置位，而不是int。
啊，我知道我没有完全理解这个问题。如果合适的话，我会编辑我的回复并保持它，以防它对遇到此问题的人有所帮助。
C 不提供“不错”的按位运算。你甚至不能便携地得到算术右移！实现可以是 2 的补码，但有符号类型上的 >> 是逻辑移位。但在实践中，人们实际想要使用的所有编译器都会为您提供有符号类型的算术右移，因此您的函数是负 toCount 的无限循环。并且签名%2 比&1 需要更多的工作，因为它必须为负奇数生成-1。但是（在普通编译器上）如果toCount 为负数，您的函数将永远不会返回，因此该问题被隐藏了......

【解决方案2】：

这是我在旧项目 (there is a research paper about it) 中使用的一些代码。下面的函数popcnt8 计算每个字节中设置的位数。

仅 SSE2 版本（基于 Hacker's Delight book 中的算法 3）：

static const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
static inline __m128i popcnt8(__m128i x) {
    __m128i n;
    // Count bits in each 4-bit field.
    n = _mm_srli_epi64(x, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
    x = _mm_and_si128(popcount_mask2, x);
    return x;
}

SSSE3 版本（由于Wojciech Mula）：

static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
static inline __m128i popcnt8(__m128i n) {
    const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(n, popcount_mask));
    const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(n, 4), popcount_mask));
    return _mm_add_epi8(pcnt0, pcnt1);
}

XOP 版本（相当于 SSSE3，但使用在 AMD Bulldozer 上更快的 XOP 指令）

static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
static const __m128i popcount_shift = _mm_set1_epi8(-4);
static inline __m128i popcount8(__m128i n) {
    const __m128i pcnt0 = _mm_perm_epi8(popcount_table, popcount_table, _mm_and_si128(n, popcount_mask));
    const __m128i pcnt1 = _mm_perm_epi8(popcount_table, popcount_table, _mm_shl_epi8(n, popcount_shift));
    return _mm_add_epi8(pcnt0, pcnt1);
}

下面的函数popcnt64统计SSE寄存器低64位和高64位的位数：

SSE2 版本：

static inline __m128i popcnt64(__m128i n) {
    const __m128i cnt8 = popcnt8(n);
    return _mm_sad_epu8(cnt8, _mm_setzero_si128());
}

XOP 版本：

static inline __m128i popcnt64(__m128i n) {
    const __m128i cnt8 = popcnt8(n);
    return _mm_haddq_epi8(cnt8);
}

最后，下面的函数popcnt128统计整个128位寄存器的位数：

static inline int popcnt128(__m128i n) {
    const __m128i cnt64 = popcnt64(n);
    const __m128i cnt64_hi = _mm_unpackhi_epi64(cnt64, cnt64);
    const __m128i cnt128 = _mm_add_epi32(cnt64, cnt64_hi);
    return _mm_cvtsi128_si32(cnt128);
}

但是，实现popcnt128 的更有效方法是使用硬件 POPCNT 指令（在支持它的处理器上）：

static inline int popcnt128(__m128i n) {
    const __m128i n_hi = _mm_unpackhi_epi64(n, n);
    #ifdef _MSC_VER
        return __popcnt64(_mm_cvtsi128_si64(n)) + __popcnt64(_mm_cvtsi128_si64(n_hi));
    #else
        return __popcntq(_mm_cvtsi128_si64(n)) + __popcntq(_mm_cvtsi128_si64(n_hi));
    #endif
}

【讨论】：

看来您是上述研究论文的合著者之一 :-) 对剪切粘贴工作人员的总结也不错。您的解决方案是最新的。 Hakem 技巧不再是最新的了。致敬，伙计！
哦，太糟糕了。你在 ACM 上发表了你的论文，所以很遗憾，我不支付 15 美元就无法阅读它：-(
@NilsPipenbrinck，该论文可在会议网站上免费获取：conferences.computer.org/sc/2012/papers/1000a033.pdf
显然，您的 SSE2 版本通常比您的 SSSE3 版本快。 SSSE3 的指令越少也没关系。这是一个基准：github.com/Const-me/LookupTables
@Sonts 可能是这样，但仅来自 Microsoft 编译器的结果并不能令人信服。

【解决方案3】：

正如第一条评论中所说，gcc 3.4+ 提供了通过

轻松访问（希望是最佳的）内置的

int __builtin_popcount (unsigned int x) /* Returns the number of 1-bits in x. */

如此处所述： http://gcc.gnu.org/onlinedocs/gcc-3.4.3/gcc/Other-Builtins.html#Other%20Builtins

不完全回答 128 位的问题，但对我到达这里时遇到的问题给出一个很好的答案:)

【讨论】：

【解决方案4】：

这是一个基于 Bit Twiddling Hacks - Counting Set Bits in Parallel 的版本，其命名类似于其他内在函数以及 16 个 32 位和 64 位向量的一些额外函数

#include "immintrin.h"

/* bit masks: 0x55 = 01010101, 0x33 = 00110011, 0x0f = 00001111 */
static const __m128i m1 = {0x5555555555555555ULL,0x5555555555555555ULL};
static const __m128i m2 = {0x3333333333333333ULL,0x3333333333333333ULL};
static const __m128i m3 = {0x0f0f0f0f0f0f0f0fULL,0x0f0f0f0f0f0f0f0fULL};
static const __m128i m4 = {0x001f001f001f001fULL,0x001f001f001f001fULL};
static const __m128i m5 = {0x0000003f0000003fULL,0x0000003f0000003fULL};

__m128i _mm_popcnt_epi8(__m128i x) {
    /* Note: if we returned x here it would be like _mm_popcnt_epi1(x) */ 
    __m128i y;
    /* add even and odd bits*/
    y = _mm_srli_epi64(x,1);  //put even bits in odd place
    y = _mm_and_si128(y,m1);  //mask out the even bits (0x55)
    x = _mm_subs_epu8(x,y);   //shortcut to mask even bits and add
    /* if we just returned x here it would be like _mm_popcnt_epi2(x) */ 
    /* now add the half nibbles */
    y = _mm_srli_epi64 (x,2); //move half nibbles in place to add
    y = _mm_and_si128(y,m2);  //mask off the extra half nibbles (0x0f)
    x = _mm_and_si128(x,m2);  //ditto
    x = _mm_adds_epu8(x,y);   //totals are a maximum of 5 bits (0x1f)
    /* if we just returned x here it would be like _mm_popcnt_epi4(x) */ 
    /* now add the nibbles */
    y = _mm_srli_epi64(x,4);  //move nibbles in place to add
    x = _mm_adds_epu8(x,y);   //totals are a maximum of 6 bits (0x3f)
    x = _mm_and_si128(x,m3);  //mask off the extra bits
    return x;
}

__m128i _mm_popcnt_epi16(__m128i x) {
    __m128i y;
    x = _mm_popcnt_epi8(x);    //get byte popcount
    y = _mm_srli_si128(x,1);   //copy even bytes for adding
    x = _mm_add_epi16(x,y);    //add even bytes into the odd bytes
    return _mm_and_si128(x,m4);//mask off the even byte and return
}

__m128i _mm_popcnt_epi32(__m128i x) {
    __m128i y;
    x = _mm_popcnt_epi16(x);   //get word popcount
    y = _mm_srli_si128(x,2);   //copy even words for adding
    x = _mm_add_epi32(x,y);    //add even words into odd words
    return _mm_and_si128(x,m5);//mask off the even words and return
}

__m128i _mm_popcnt_epi64(__m128i x){
    /* _mm_sad_epu8() is weird
       It takes the absolute difference of bytes between 2 __m128i
       then horizontal adds the lower and upper 8 differences
       and stores the sums in the lower and upper 64 bits
    */
    return _mm_sad_epu8(_mm_popcnt_epi8(x),(__m128i){0});
}

int _mm_popcnt_si128(__m128i x){
    x = _mm_popcnt_epi64(x);
    __m128i y = _mm_srli_si128(x,8);
    return _mm_add_epi64(x,y)[0];
    //alternative: __builtin_popcntll(x[0])+__builtin_popcntll(x[1]);
}

【讨论】：

为什么在第一步之后的步骤中需要饱和 adds 而不是常规的 add？（尽管根据 Agner Fog 的指令表，paddusb 在所有方面的性能都与paddb 相同，因此没有理由避免饱和添加。这令人惊讶。）