如何使用布尔数组作为 AVX 掩码加载的掩码？答案

【问题标题】：How to use boolean array as mask for AVX maskload?如何使用布尔数组作为 AVX 掩码加载的掩码？
【发布时间】：2020-11-19 13:58:25
【问题描述】：

我有一个浮点数组和一个布尔数组，其中布尔数组中具有相应真值的所有浮点数需要相加。我考虑过使用_mm256_maskload_pd 加载每个浮点向量，然后将它们与累加器相加，然后在最后进行水平求和。但是，我不确定如何使布尔数组与此操作所需的 __m256i 掩码类型一起使用。

我对使用 SIMD/AVX 非常陌生，所以我不确定我是否走错了方向。使用 512 位 AVX 也很好，但我没有找到足够的指令似乎对此有用。

我的目标（没有 SIMD）是这个（Rust 代码，但对于答案，我对内在函数更感兴趣，所以 C(++) 很好）：

let mask: [bool] = ...;
let floats: [f64] = ...;

let sum = 0.0;
for (val, cond) in floats.zip(mask) {
    if cond {
        sum += val;
    }
}

【问题讨论】：

AVX512 让它变得微不足道； masking 是 AVX-512 的核心部分，可以直接使用位向量。 AVX 必须将掩码扩展为与maskload 或vandpd 或其他任何内容一起使用的向量。（_mm256_maskload_pd 可能是个不错的选择）
我认为这是 AVX/AVX2 部分的 is there an inverse instruction to the movemask instruction in intel avx2? 的副本。
使用 AVX512，您可以从 mask 数组中读取 __mask8 值以用于 _mm512_maskz_loadu_pd( __mmask8 k, void * s)，即 1 个字节中的 8 位块。或者在普通加载后合并屏蔽vaddpd，这可能有助于编译器折叠内存操作数。或者你有一个实际的bool 的解包数组，每个字节有一个bool？如果是这样，不要为 AVX512 那样做！使用 AVX2 更容易解包，但对 AVX-512 来说效率较低。
是的，你会在最后使用向量累加器和 hsum。或者对于更大的阵列，使用多个累加器来隐藏延迟瓶颈。例如对于 L1d 高速缓存中热的大型阵列，理论上 8 个向量累加器可以跟上vaddpd zmm0{k1}, zmm0, [rdi] 4c 延迟，0.5c 吞吐量来执行 2 个向量加载+每个时钟添加。（虽然在实践中你会在端口 5 上将数据移动到掩码寄存器中遇到瓶颈：/）
是的，使用 x86 SIMD 创建打包掩码非常容易（movmskps/pd 和 pmovmskb 自 SSE1 以来就存在），并且使用 AVX-512 使用它们也变得高效。

标签： simd avx avx2 avx512

【解决方案1】：

这是 C++/17 中的 AVX2 版本，未经测试。

#include <immintrin.h>
#include <stdint.h>
#include <array>
#include <assert.h>
using Accumulators = std::array<__m256d, 4>;

// Conditionally accumulate 4 lanes, with source data from a register
template<int i>
inline void conditionalSum4( Accumulators& acc, __m256d vals, __m256i broadcasted )
{
    // Unless the first register in the batch, shift mask values by multiples of 4 bits
    if constexpr( i > 0 )
        broadcasted = _mm256_srli_epi64( broadcasted, i * 4 );
    // Bits 0-3 in 64-bit lanes of `broadcasted` correspond to the values being processed

    // Compute mask from the lowest 4 bits of `broadcasted`, each lane uses different bit
    const __m256i bits = _mm256_setr_epi64x( 1, 2, 4, 8 );
    __m256i mask = _mm256_and_si256( broadcasted, bits );
    mask = _mm256_cmpeq_epi64( mask, bits );    // Expand bits into qword-s

    // Bitwise AND to zero out masked out lanes: integer zero == double 0.0
    // BTW, if your mask has 1 for values you want to ignore, _mm256_andnot_pd
    vals = _mm256_and_pd( vals, _mm256_castsi256_pd( mask ) );

    // Accumulate the result
    acc[ i ] = _mm256_add_pd( acc[ i ], vals );
}

// Conditionally accumulate 4 lanes, with source data from memory
template<int i>
inline void conditionalSum4( Accumulators& acc, const double* source, __m256i broadcasted )
{
    constexpr int offset = i * 4;
    const __m256d vals = _mm256_loadu_pd( source + offset );
    conditionalSum4<i>( acc, vals, broadcasted );
}

// Conditionally accumulate lanes from memory, for the last potentially incomplete vector
template<int i>
inline void conditionalSumPartial( Accumulators& acc, const double* source, __m256i broadcasted, size_t count )
{
    constexpr int offset = i * 4;
    __m256d vals;
    __m128d low, high;
    switch( count - offset )
    {
    case 1:
        // Load a scalar, zero out other 3 lanes
        vals = _mm256_setr_pd( source[ offset ], 0, 0, 0 );
        break;
    case 2:
        // Load 2 lanes
        low = _mm_loadu_pd( source + offset );
        high = _mm_setzero_pd();
        vals = _mm256_setr_m128d( low, high );
        break;
    case 3:
        // Load 3 lanes
        low = _mm_loadu_pd( source + offset );
        high = _mm_load_sd( source + offset + 2 );
        vals = _mm256_setr_m128d( low, high );
        break;
    case 4:
        // The count of values was multiple of 4, load the complete vector
        vals = _mm256_loadu_pd( source + offset );
        break;
    default:
        assert( false );
        return;
    }
    conditionalSum4<i>( acc, vals, broadcasted );
}

// The main function implementing the algorithm.
// maskBytes argument is densely packed mask values with 1 bit per double, the size must be ( ( count + 7 ) / 8 )
// Each byte of the mask packs 8 boolean values, the first value of the byte is in the least significant bit.
double conditionalSum( const double* source, const uint8_t* maskBytes, size_t count )
{
    // Zero-initialize all 4 accumulators
    std::array<__m256d, 4> acc;
    acc[ 0 ] = acc[ 1 ] = acc[ 2 ] = acc[ 3 ] = _mm256_setzero_pd();

    // The main loop consumes 16 scalars, and 16 bits of the mask, per iteration
    for( ; count >= 16; source += 16, maskBytes += 2, count -= 16 )
    {
        // Broadcast 16 bits of the mask from memory into AVX register
        // Technically, C++ standard says casting pointers like that is undefined behaviour.
        // Works fine in practice; alternatives exist, but they compile into more than 1 instruction.
        const __m256i broadcasted = _mm256_set1_epi16( *( (const short*)maskBytes ) );

        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSum4<1>( acc, source, broadcasted );
        conditionalSum4<2>( acc, source, broadcasted );
        conditionalSum4<3>( acc, source, broadcasted );
    }

    // Now the hard part, dealing with the remainder
    // The switch argument is count of vectors in the remainder, including incomplete ones.
    switch( ( count + 3 ) / 4 )
    {
    case 0:
        // Best case performance wise, the count of values was multiple of 16
        break;
    case 1:
    {
        // Note we loading a single byte from the mask instead of 2 bytes. Same for case 2.
        const __m256i broadcasted = _mm256_set1_epi8( (char)*maskBytes );
        conditionalSumPartial<0>( acc, source, broadcasted, count );
    }
    case 2:
    {
        const __m256i broadcasted = _mm256_set1_epi8( (char)*maskBytes );
        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSumPartial<1>( acc, source, broadcasted, count );
        break;
    }
    case 3:
    {
        const __m256i broadcasted = _mm256_set1_epi16( *( (const short*)maskBytes ) );
        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSum4<1>( acc, source, broadcasted );
        conditionalSumPartial<2>( acc, source, broadcasted, count );
        break;
    }
    case 4:
    {
        const __m256i broadcasted = _mm256_set1_epi16( *( (const short*)maskBytes ) );
        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSum4<1>( acc, source, broadcasted );
        conditionalSum4<2>( acc, source, broadcasted );
        conditionalSumPartial<3>( acc, source, broadcasted, count );
        break;
    }
    }

    // At last, compute sum of the 16 accumulated values
    const __m256d r01 = _mm256_add_pd( acc[ 0 ], acc[ 1 ] );
    const __m256d r23 = _mm256_add_pd( acc[ 2 ], acc[ 3 ] );
    const __m256d sum4 = _mm256_add_pd( r01, r23 );
    const __m128d sum2 = _mm_add_pd( _mm256_castpd256_pd128( sum4 ), _mm256_extractf128_pd( sum4, 1 ) );
    const __m128d sum1 = _mm_add_sd( sum2, _mm_shuffle_pd( sum2, sum2, 0b11 ) );
    return _mm_cvtsd_f64( sum1 );
}

几个有趣的点。

我将循环展开 16 个值，并使用 4 个独立的累加器。因为流水线增加了带宽。减少循环退出检查的频率，即更多的指令用于计算有用的东西。降低广播掩码值的频率，将数据从标量单位移动到矢量单位有一些延迟。请注意，我每 16 个元素仅从掩码加载一次，并通过位移重用向量。还可以提高精度，当您将小浮点值添加到大浮点值时，精度会丢失，16 个标量累加器会有所帮助。

正确处理这些余数，无需将数据从寄存器移动到内存再返回，这很复杂，需要部分加载等。

如果你将整数值从模板参数移到普通整数参数中，代码可能会停止编译，编译器会说类似“预期的常量表达式”。原因是，许多 SIMD 指令，包括_mm256_srli_epi64，将常量编码为指令的一部分，因此编译器需要知道这些值。另一件事，数组索引需要是 constexpr ，否则编译器会将数组从 4 个寄存器驱逐到 RAM 中，以便在您访问值时能够进行指针数学运算。累加器需要一直留在寄存器中，否则性能会大幅下降，即使 L1D 缓存比寄存器慢得多。

Here’s the output of gcc。该程序集似乎是合理的，编译器已成功内联所有内容，并且主循环中唯一的内存访问是源值。主循环在.L3 标签的下方。

【讨论】：

_mm256_blendv_pd 花费 2 微秒，并将混合操作放在关键路径上（循环携带的 dep 链）。通常最好只使用 maskload 或其他东西，因此您将 0.0 添加到掩码为 false 的元素中。在最坏的情况下，在添加之前与0.0 混合，因此它不在关键路径上，并且是从内存中准备输入向量的一部分，但我认为_mm256_maskload_pd (vmaskmovpd) 在大多数 CPU 上更有效。总共只有 2 个微指令，包括 Skylake 上的负载，Zen2 上的 1 个。
@PeterCordes vmaskmovpd 在 AMD 上延迟很大，50 个周期。 Vector blend 的 2 条微指令可能比等效的多条指令要快，你需要两个，vpcmpeqq 和 vpand。
uops.info/html-lat/ZEN2/… 表明在 Zen2 上它是 uops.info/… 上启用该列。但无论如何，如果你想避免 maskload，那么你仍然可以在关键路径之外与零混合。
另外，您可以只使用 VPAND / VPCMPEQQ 来生成掩码，并将其与 VANDPD 一起使用，而不是 _mm256_sllv_epi64 + blendv。（并且仍然将向量右移以与固定的 0, 1sllv 0,1,2,3。或者只是提升一个 srlv 0,1,2,3 并在循环内使用 srli + slli 63 ？在 sllv 是多个微指令的 Haswell 上更好。
您可以获得一些可能无关紧要的 ILP（但可能有助于从掩码阵列上的缓存未命中中赶上）：使循环携带的 broadcasted dep 链为_mm256_srli_epi64 shift 3 种不同的方式（4、8、12），仅使用最后一种方式更新原始方式。所以最后一组 4 位只是 64/16 = 4 移位而不是 64/4 = 16 移位。 OTOH 在每个班次结果准备好后，还有超过 1 个周期的 ALU 工作要做，所以它可能永远不会成为瓶颈。