如何将内存中的 96 位加载到 XMM 寄存器中？答案

【问题标题】：How to load 96 bits from memory into an XMM register?如何将内存中的 96 位加载到 XMM 寄存器中？
【发布时间】：2016-08-02 11:38:36
【问题描述】：

假设我在rsi 中有一个指向内存的指针，我想加载 12 字节指向xmm0 的低 96 位的值。我不在乎会发生什么高 32 位。有什么有效的方法来做到这一点？

（附带问题：我想出的最好的方法是movlpd“Move Low 压缩双精度浮点值”指令。有什么办法在该指令是针对浮点值的？我不了解以这种方式记录的内容；当然它应该适用于整数也是。）

【问题讨论】：

标签： assembly intel sse sse2 sse4

【解决方案1】：

如果 16 字节加载不会跨入另一个页面并出现错误，则使用 movups。高 4 个字节将是内存中的任何垃圾。导致您不关心的 4B 缓存未命中可能是一个问题，缓存行拆分也可能是一个问题。

否则使用movq / pinsrd (SSE4.1)，或其他方式进行两次加载 + 随机播放。 movq + pinsrd 将是英特尔 SnB 系列 CPU 上的 3 个融合域微指令，因为 pinsrd 不能微融合。（并且它的 ALU uop 需要 shuffle 端口 (p5)）。

另一种可能性：AVX VMASKMOVPS xmm1, xmm2, m128。

有条件地将打包数据元素从第二个源操作数移动到相应的数据元素中目标操作数，取决于与每个数据元素关联的掩码位（第一个 src 操作数的 MSB）。

... 故障不会因为以下原因而发生如果该内存位置的相应掩码位为 0，则引用该内存位置。

英特尔 Haswell：3 个融合域微指令（一个加载和两个洗牌 (p5)）。 4c 延迟，每 2c 吞吐量一个。

相比起来可能不是很好，尤其是。如果周围的代码必须洗牌。

您非常罕见的条件分支使用movups，只要它保证不会出错，它也是快速路径上的 3 个融合域微指令，其中一个可以在端口 6 上运行（不与向量 ALU 竞争完全）。 LEA 也不在关键路径上。

movlpd 可以安全地用于任何数据。对于代表浮点 NaN 或类似的数据，它永远不会出错或变慢。您只需使用 insn ref 手册中列出的说明以及非空的“SIMD 浮点异常”部分来担心这一点。例如addps 可以生成“上溢、下溢、无效、精确、异常”异常，但shufps 说“无”。

【讨论】：

不幸的是我不控制输入的位置或尺寸，所以我不能过度阅读。 movq 和 pinsrd 也是我想出来的；感谢您的确认。
也感谢您对movlpd 的关注。但我的问题是：为什么它被记录为特别适用于浮点值？
@jacobsa：如果你知道在你想要的12B之前有可读内存，你可以从[addr-4]加载然后移位（psrldq）。或者您甚至可以屏蔽地址以获得一个 16B 对齐的指针，该指针涵盖了您想要的一些数据（并且仍然不会出错）。
@jacobsa：IDK 为什么有这么多不同的洗牌和movhlps 之类的东西用于浮点和整数。我从未见过设计决策的任何理由。我总是假设英特尔在当时提出了一些看起来不错的想法，在设计 SSE 时将 MMX 历史与尝试其他想法结合起来。当然，未来的 CPU 设计在将movlpd 加载的结果转发到整数指令时可能会有更高的延迟。这就是为什么他们在 movups 已经存在的情况下制作 movupd 和 movdqu 的原因。一些设计 (Nehalem) 关心 reg-reg 移动......
非常感谢，说得通。

【解决方案2】：

Peter Cordes 的回答让我想起了页面，最后我只是检查一下我们是否有可能出错：

 // We'd like to perform only a single load from memory, but there's no 96-bit
 // load instruction and it's not necessarily safe to load the full 128 bits
 // since this may read beyond the end of the buffer.
 //
 // However, observe that memory protection applies with granularity of at
 // most 4 KiB (the smallest page size). If the full 16 bytes lies within a
 // single 4 KiB page, then we're fine. If the 12 bytes we are to read
 // straddles a page boundary, then we're also fine (because the next four
 // bytes must lie in the second page, which we're already reading). The only
 // time we're not guaranteed to be okay to read 16 bytes is if the 12 bytes
 // we want to read lie near the end of one page, and some or all of the
 // following four bytes lie within the next page.
 //
 // In other words, the only time there's a risk is when the pointer mod 4096
 // is in the range [4081, 4085). This is <0.1% of addresses. Check for this
 // and handle it specially.
 //
 // We perform the check by adding 15 and then checking for the range [0, 3).
 lea rax, [rsi+15]
 test eax, 0xffc
 jz slow_read

 // Hooray, we can load from memory just once.
 movdqu xmm0, XMMWORD PTR [rsi]

done_reading:
 [...]

slow_read:
 movq xmm1, QWORD PTR [rsi]
 pinsrd xmm1, DWORD PTR [rsi+8], 2
 jmp done_reading

【讨论】：

更快：lea eax, [rsi+15] / test eax, 0xffc / jz。由于您只对低字节感兴趣，因此您不需要 64 位 reg，从而节省了 REX 前缀上的字节。我使用 eax 是因为有一个特殊的编码 test eax, imm32。您可以修改它以使用slow_read 进行高速缓存行拆分，但您在分支错误预测上的损失可能比您获得的要糟糕得多。您应该使用movq / pinsrd 无条件地对此进行测试，以确保它更好。 lea / fused test-and-branch / movups 为 3 uops，与 movq / pinsrd 相同，但关键路径更短。
好老lea，我怎么忘记了？谢谢。我的基准测试不够敏感，无法区分这里（或者这不是瓶颈），但你的补丁更好。完成。
lea EAX, [rsi+15] 在编码中保存另一个字节。默认地址大小为 64 位，但默认操作数大小为 32 位，即使对于 LEA。此外，movdqu 加载到与movq / pinsrd 不同的寄存器也很奇怪。（此外，您可以使用movups 保存一个字节。使用 FP 加载/存储没有缺点，clang 实际上有时会这样做。但是，请为 reg-reg 移动的数据使用正确的 insn 类型，因为有些 uarches关心那个。）

【解决方案3】：

    movss xmm0, [rdx+8]         //; +8*8Bits = 64 Bits
    pshufd xmm0, xmm0, 0x00     //; spreading it in every part
    movlps xmm0, [rdx]          //; overwriting the lower with 64 Bits

在我的情况下，它使用 Float，不确定它是否适合您。

【讨论】：