如何将字节复制到 xmm0 寄存器中答案

【问题标题】：how to copy bytes into xmm0 register如何将字节复制到 xmm0 寄存器中
【发布时间】：2017-01-25 23:08:58
【问题描述】：

我有以下代码可以正常工作，但考虑到最终结果只需要 xmm0 中的数据，它似乎效率低

         mov rcx, 16                       ; get first word, up to 16 bytes
         mov rdi, CMD                      ; ...and put it in CMD
         mov rsi, CMD_BLOCK
 @@:     lodsb
         cmp al, 0x20
         je @f
         stosb
         loop @b

 @@:     mov rsi, CMD                      ;
         movdqa xmm0, [rsi]                ; mov cmd into xmm0

我确定使用 SSE2、SSE4 等，有一种更好的方法不需要使用 CMD 缓冲区，但我正在努力解决如何做到这一点。

【问题讨论】：

向量的高字节是否需要零？即您是否在循环之前将 CMD 归零？
您能否详细说明您正在尝试做什么？这里的要求不明确。您的投入和目标是什么？

标签： assembly x86 sse sse2 sse4

【解决方案1】：

您的代码看起来像是从 CMD_BLOCK 到第一个 0x20 获取字节，我假设希望在此之上有零。

这甚至不是编写一个字节一次的复制循环的最有效方法。切勿使用 LOOP 指令，除非您专门针对 the few architectures where it's not slow (e.g. AMD Bulldozer) 之一进行调整。查看 Agner Fog 的资料，以及来自 x86 标签 wiki 的其他链接。或者通过 C 内部函数使用 SSE/AVX，并让编译器生成实际的 asm。

但更重要的是，如果你使用 SSE 指令，你甚至不需要循环。

我假设您在开始复制之前将 16B CMD 缓冲区归零，否则您最好只进行未对齐的加载并抓取超出您想要的数据的任何垃圾字节。

如果您可以安全地读取 CMD_BLOCK 的末尾而不会导致段错误，事情就会容易得多。希望你能安排它是安全的。例如确保它不是在一个未映射页面之后的页面的最后。如果没有，您可能需要执行对齐加载，然后如果您没有得到数据的结尾，则有条件地进行另一个对齐加载。

SSE2 pcmpeqb，找到第一个匹配，以及该位置及更高位置的零字节

section .rodata

ALIGN 32              ; No cache-line splits when taking an unaligned 16B window on these 32 bytes
dd -1, -1, -1, -1
zeroing_mask:
dd  0,  0,  0,  0

ALIGN 16
end_pattern:  times 16   db 0x20    ; pre-broadcast the byte to compare against  (or generate it on the fly)

section .text

    ... as part of some function ...
    movdqu   xmm0, [CMD_BLOCK]       ; you don't have to waste instructions putting pointers in registers.
    movdqa   xmm1, [end_pattern]     ; or hoist this load out of a loop
    pcmpeqb  xmm1, xmm0

    pmovmskb eax, xmm1
    bsr      eax, eax                ; number of bytes of the vector to keep
    jz    @no_match                  ; bsr is weird when input is 0 :(
    neg      rax                     ; go back this far into the all-ones bytes
    movdqu   xmm1, [zeroing_mask + rax]   ; take a window of 16 bytes
    pand     xmm0, xmm1
@no_match:                          ; all bytes are valid, no masking needed
    ;; XMM0 holds bytes from [CMD_BLOCK], up to but not including the first 0x20.

在 Intel Haswell 上，从输入到 PCMPEQB 准备就绪，直到 PAND 的输出准备好，这应该有大约 11c 的延迟。

如果你可以使用LZCNT 代替 BSR，你可以避免分支。你。由于在不匹配的情况下我们想要一个 16（所以 neg eax 给出 -16，并且我们加载一个全一的向量），一个 16 位的 LZCNT 就可以解决问题。（lzcnt ax, ax 有效，因为从 pmovmskb 开始，RAX 的高字节已经为零。否则 xor ecx, ecx / lzcnt cx, ax）

这种带有未对齐负载的掩码生成想法以获取一些全1和全零的窗口与我在Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all上的答案之一相同。

还有其他方法可以从内存中加载掩码。例如将第一个全一字节广播到向量的所有较高字节，每次将屏蔽区域的长度加倍，直到它大到足以覆盖整个向量，即使 0xFF 字节是第一个字节。

    movdqu   xmm0, [CMD_BLOCK]
    movdqa   xmm1, [end_pattern]
    pcmpeqb  xmm1, xmm0             ; 0 0 ... -1 ?? ?? ...

    movdqa   xmm2, xmm1
    pslldq   xmm2, 1
    por      xmm1, xmm2             ; 0 0 ... -1 -1 ?? ...

    movdqa   xmm2, xmm1
    pslldq   xmm2, 2
    por      xmm1, xmm2             ; 0 0 ... -1 -1 -1 -1 ?? ...

    pshufd   xmm2, xmm1, 0b10010000  ; [ a b c d ] -> [ a a b c ]
    por      xmm1, xmm2              ; 0 0 ... -1 -1 -1 -1 -1 -1 -1 -1 ?? ... (8-wide)

    pshufd   xmm2, xmm1, 0b01000000  ; [ abcd ] -> [ aaab ]
    por      xmm1, xmm2              ; 0 0 ... -1 (all the way to the end, no ?? elements left)
    ;; xmm1 = the same mask the other version loads with movdqu based on the index of the first match

    pandn    xmm1, xmm0              ; xmm1 = [CMD_BLOCK] with upper bytes zeroed


    ;; pshufd instead of copy + vector shift  works:
    ;; [ abcd  efgh  hijk  lmno ]
    ;; [ abcd  abcd  efgh  hijk ]  ; we're ORing together so it's ok that the first 4B are still there instead of zeroed.

SSE4.2 PCMPISTRM:

如果您与终止符进行异或，使 0x20 字节变为 0x00 字节，您也许可以使用 SSE4.2 字符串指令，因为它们已经设置为处理隐式长度字符串，其中 0x00 之外的所有字节都是无效的。请参阅this tutorial/example，因为英特尔的文档只是详细记录了所有内容，而没有首先关注重要的内容。

PCMPISTRM 在 Skylake 上以 9 个周期延迟运行，在 Haswell 上以 10c 延迟运行，在 Nehalem 上以 7c 延迟运行。所以这是关于 Haswell 延迟的收支平衡，或者实际上是损失，因为我们还需要 PXOR。寻找 0x00 字节并标记除此之外的元素是硬编码的，因此我们需要一个 XOR 将 0x20 字节转换为 0x00。但它的微指令少了很多，代码量也少了。

;; PCMPISTRM imm8:
;; imm8[1:0] = 00 = unsigned bytes
;; imm8[3:2] = 10 = equals each, vertical comparison.  (always not-equal since we're comparing the orig vector with one where we XORed the match byte)
;; imm8[5:4] = 11 = masked(-): inverted for valid bytes, but not for invalid  (TODO: get the logic on this and PAND vs. PANDN correct)
;; imm8[6] = 1 = output selection (byte mask, not bit mask)
;; imm8[7] = 0 (reserved.  Holy crap, this instruction has room to encode even more functionality??)

movdqu     xmm1, [CMD_BLOCK]

movdqa     xmm2, xmm1
pxor       xmm2, [end_pattern]       ; turn the stop-character into 0x00 so it looks like an implicit-length string
                                     ; also creating a vector where every byte is different from xmm1, so we get guaranteed results for the "valid" part of the vectors (unless the input string can contain 0x0 bytes)
pcmpistrm  xmm1, xmm2, 0b01111000    ; implicit destination operand: XMM0
pand       xmm0, xmm1

我可能没有正确的 pcmpistrm 参数，但我没有时间对其进行测试或进行心理验证。可以这么说，我很确定可以让它制作一个在第一个零字节之前全为一的掩码，然后从那里开始全一。

【讨论】：

正是我想要的！其实不止。非常感谢您花时间在这个详细的答案中，这对我有很大帮助！
@poby：但等等，还有更多：P 当你发表评论时，我正在更新 pcmpistrm 部分。
你也可以避免使用bsr 的分支，依靠它的半文档属性，当输入为零时它不会覆盖其输出
lzcnt 可能不可用，如果由于某种原因您也不能使用 Harold 的出色技巧，请记住您可以通过使用 CMOV 指令来避免分支（这非常重要）。