使用基于掩码的 simd 从大向量加载向量答案

【问题标题】：load vector from large vector with simd based on mask使用基于掩码的 simd 从大向量加载向量
【发布时间】：2015-07-24 03:16:57
【问题描述】：

我希望有人可以在这里提供帮助。

我有一个大字节向量，我从中创建一个小字节向量（基于掩码），然后我使用 simd 进行处理。

当前掩码是 baseOffset + 子掩码 (byte[256]) 的数组，针对存储进行了优化，因为有 > 10^8 。我创建了一个 maxsize 子向量，然后循环遍历掩码数组，将 baseOffsset 乘以 256，并为掩码中的每个位偏移量从大向量加载并将值顺序放入较小的向量中。然后通过多个 VPMADUBSW 处理较小的向量并累积。我可以改变这个结构。例如，遍历位一次以使用 8K 位数组缓冲区，然后创建小向量。

有没有更快的方法可以创建子数组？

我将代码从应用程序中提取到测试程序中，但原始代码处于不断变化的状态（移至 AVX2 并从 C# 中提取更多内容）

#include "stdafx.h"
#include<stdio.h>
#include <mmintrin.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <smmintrin.h>
#include <immintrin.h>


//from 
char N[4096] = { 9, 5, 5, 5, 9, 5, 5, 5, 5, 5 };
//W
char W[4096] = { 1, 2, -3, 5, 5, 5, 5, 5, 5, 5 };

char buffer[4096] ; 





__declspec(align(2))
struct packed_destination{
    char blockOffset;
    __int8   bitMask[32];

};

__m128i sum = _mm_setzero_si128();
packed_destination packed_destinations[10];



void  process128(__m128i u, __m128i s)
{
    __m128i calc = _mm_maddubs_epi16(u, s); // pmaddubsw 
    __m128i loints = _mm_cvtepi16_epi32(calc);
    __m128i hiints = _mm_cvtepi16_epi32(_mm_shuffle_epi32(calc, 0x4e));
    sum = _mm_add_epi32(_mm_add_epi32(loints, hiints), sum);
}

void process_array(char n[], char w[], int length)
{
    sum = _mm_setzero_si128();
    int length128th  = length >> 7;
    for (int i = 0; i < length128th; i++)
    {
        __m128i u = _mm_load_si128((__m128i*)&n[i * 128]);
        __m128i s = _mm_load_si128((__m128i*)&w[i * 128]);
        process128(u, s);
    }
}


void populate_buffer_from_vector(packed_destination packed_destinations[], char n[]  , int  dest_length)
{
    int buffer_dest_index = 0; 
    for (int i = 0; i < dest_length; i++)
    {
        int blockOffset = packed_destinations[i].blockOffset <<8 ;
        // go through mask and copy to buffer
        for (int j = 0; j < 32; j++)
        {
           int joffset = blockOffset  + j << 3; 
            int mask = packed_destinations[i].bitMask[j];
            if (mask & 1 << 0)
                buffer[buffer_dest_index++] = n[joffset +  1<<0 ];
            if (mask & 1 << 1)
                buffer[buffer_dest_index++] = n[joffset +  1<<1];
            if (mask & 1 << 2)
                buffer[buffer_dest_index++] = n[joffset +  1<<2];
            if (mask & 1 << 3)
                buffer[buffer_dest_index++] = n[joffset +   1<<3];
            if (mask & 1 << 4)
                buffer[buffer_dest_index++] = n[joffset +  1<<4];
            if (mask & 1 << 5)
                buffer[buffer_dest_index++] = n[joffset +  1<<5];
            if (mask & 1 << 6)
                buffer[buffer_dest_index++] = n[joffset + 1<<6];
            if (mask & 1 << 7)
                buffer[buffer_dest_index++] = n[joffset +  1<<7];
        };

    }


}

int _tmain(int argc, _TCHAR* argv[])
{
    for (int i = 0; i < 32; ++i)
    {
        packed_destinations[0].bitMask[i] = 0x0f;
        packed_destinations[1].bitMask[i] = 0x04;
    }
    packed_destinations[1].blockOffset = 1;

    populate_buffer_from_vector(packed_destinations, N, 1);
    process_array(buffer, W, 256);

    int val = sum.m128i_i32[0] +
        sum.m128i_i32[1] +
        sum.m128i_i32[2] +
        sum.m128i_i32[3];
    printf("sum is %d"  , val);
    printf("Press Any Key to Continue\n");
    getchar();
    return 0;
}

对于某些工作负载，通常掩码使用率为 5-15%，为 25-100%。

MASKMOVDQU 已关闭，但我们必须在保存之前根据掩码重新打包 /swl..

【问题讨论】：

如果您发布现有代码可能会有所帮助。
你的 process128 函数看起来坏了 - 它实际上并没有使用传递给它的参数？
已修复 .. 我将函数拉出以便制作 avx2 .process256
我们可能需要知道掩码的稀疏程度。如果掩码不是特别稀疏，则仅遍历大向量并根据需要从总和中屏蔽元素可能会更有效（使用 SIMD）。在另一个极端，如果掩码足够稀疏，那么您可能会使create_array 函数更有效。
它是一个稀疏矩阵 .. 和通常的 col : 行数组在内存方面太贵了，所以我使用掩码.. 通常使用率是 5-15%，偶尔会是 25-100 % 。 N 高达 64K，应该在 L2 中。缓冲区在 500-3000 范围内，因此 packed_destinations 理想情况下为 2 - 6 但是虽然掩码将集中在块中，但它并不理想，我假设 packed_destinations 长度为 6-20 。创建数组目前在 c# 中，我希望可以删除它的低效率。

标签： c++11 simd avx avx2

【解决方案1】：

对现有代码的一些优化：

如果您的数据是稀疏的，那么在测试附加位之前添加对每个 8 位掩码值的附加测试可能是个好主意，即

        int mask = packed_destinations[i].bitMask[j];
        if (mask != 0)
        {
            if (mask & 1 << 0)
                buffer[buffer_dest_index++] = n[joffset +  1<<0 ];
            if (mask & 1 << 1)
                buffer[buffer_dest_index++] = n[joffset +  1<<1];
            ...

其次，您的process128 函数可以大大优化：

inline __m128i process128(const __m128i u, const __m128i s, const __m128i sum)
{
    const __m128i vk1 = _mm_set1_epi16(1);
    __m128i calc = _mm_maddubs_epi16(u, s);
    calc = _mm_madd_epi16(v, vk1);
    return _mm_add_epi32(sum, calc);
}

请注意，除了将 SSE 指令数从 6 减少到 3 之外，我还设置了 sum 一个参数，以摆脱对全局变量的任何依赖（避免全局变量总是一个好主意，不仅良好的软件工程，但也因为它们可以抑制某些编译器优化）。

查看您的代码配置文件会很有趣（使用像样的采样分析器，而不是通过检测），因为这将有助于确定任何进一步优化工作的优先级。

【讨论】：

谢谢保罗。没有意识到全局禁止选择。还想知道现有流行向量与某种掩码移动到 reg 之间的权衡，然后通过_mm256_shuffle_epi8 删除空间 .. 1000 次单个字节移动让我感到紧张，但应该是 1 级缓存将分析但有很多代码需要重写和提供一些结构。当前代码是 100% c# 将繁重的工作转换为 c。没有绝对的性能，但我可以使用的神经元越快，从而提高准确性。如果我能从 70% 提高到 85%，那将是一个巨大的胜利。
好的 - 在你尝试完之后回来，用你的最新代码、基准和配置文件提出一个新问题，我们可以看看有哪些进一步的优化可能。