NASM 浮点 - 操作码和操作数的无效组合答案

【问题标题】：NASM floating point - invalid combination of opcode and operandsNASM 浮点 - 操作码和操作数的无效组合
【发布时间】：2019-02-01 21:47:34
【问题描述】：

我正在尝试从this article on x86 assembly floating point 编译以下代码示例（NASM 语法）：

;; c^2 = a^2 + b^2 - cos(C)*2*a*b
;; C is stored in ang

global _start

section .data
    a: dq 4.56   ;length of side a
    b: dq 7.89   ;length of side b
    ang: dq 1.5  ;opposite angle to side c (around 85.94 degrees)

section .bss
    c: resq 1    ;the result ‒ length of side c

section .text
    _start:

    fld qword [a]   ;load a into st0
    fmul st0, st0   ;st0 = a * a = a^2

    fld qword [b]   ;load b into st1
    fmul st1, st1   ;st1 = b * b = b^2

    fadd st1, st0   ;st1 = a^2 + b^2

    fld qword [ang] ;load angle into st0
    fcos            ;st0 = cos(ang)

    fmul qword [a]  ;st0 = cos(ang) * a
    fmul qword [b]  ;st0 = cos(ang) * a * b
    fadd st0, st0   ;st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)

    fsubp st1, st0  ;st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
                    ;and pop st0

    fsqrt           ;take square root of st0 = c

    fst qword [c]   ;store st0 in c ‒ and we're done!

当我执行以下命令时：

nasm -f elf32 cosineSample.s -o cosineSample.o

fmul st1, st1 行出现以下错误：

error: invalid combination of opcode and operands

我需要做什么来解决这个问题？我是否需要将特殊参数传递给nasm？代码示例错了吗？

【问题讨论】：

修改代码错误的wiki版本是en.wikibooks.org/w/index.php?title=X86_Assembly/…。我将对其进行编辑以修复它。
啊，真好。猜猜我在正确的地方问了这个问题=）

标签： assembly compiler-errors x86 nasm x87

【解决方案1】：

不幸的是，该代码已损坏。 fmul 不能对st1, st1 进行操作，但即使这样做，也不会按照作者的意愿进行操作。根据评论，他想计算b*b，但此时b 在st0 中。注释load b into st1 是错误的，fld 总是加载到st0（堆栈顶部）。您需要将fmul st1, st1 更改为fmul st0, st0。此外，为了得到正确的结果，下面的fadd st1, st0 也必须颠倒过来。该代码还会使 fpu 堆栈变脏。

还要注意程序没有结尾，所以它会出现段错误，除非你添加一个显式的exit 系统调用。

这里是固定代码，转换成gnu汇编语法：

.intel_syntax noprefix

.global _start

.data
    a: .double 4.56   # length of side a
    b: .double 7.89   # length of side b
    ang: .double 1.5  # opposite angle to side c (around 85.94 degrees)

.lcomm c, 8

.text
    _start:

    fld qword ptr [a]   # load a into st0
    fmul st             # st0 = a * a = a^2

    fld qword ptr [b]   # load b into st0
    fmul st             # st0 = b * b = b^2

    faddp               # st0 = a^2 + b^2

    fld qword ptr [ang] # load angle into st0
    fcos                # st0 = cos(ang)

    fmul qword ptr [a]  # st0 = cos(ang) * a
    fmul qword ptr [b]  # st0 = cos(ang) * a * b
    fadd st             # st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)

    fsubp               # st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
                        # and pop st0

    fsqrt               # take square root of st0 = c

    fstp qword ptr [c]  # store st0 in c - and we're done!

    # end program
    mov eax, 1
    xor ebx, ebx
    int 0x80

【讨论】：

通过“下面的fadd st1, st0也必须颠倒”，你的意思是应该是fadd st0, st1吗？
是的，没错。具有讽刺意味的是，它在较早的版本中是正确的，但被“修复”了。
啊，我明白了。谢谢！
引用，“[t]他的代码也让 fpu 堆栈变脏”。这是因为在原始代码中，fst 被调用而不是fstp，因此在堆栈中留下了一个值？另外，如果我们随后仍然使用fsqrt，那么fsubp 的意义何在 - 减法的结果被弹出到哪里？
是的，最终的fst 以及之前的一个额外临时项目仍在该版本的堆栈中。 fstp 不会弹出结果，它会弹出操作数之一。结果留在栈顶。

【解决方案2】：

我在 Wikibooks 上修复了代码并添加了一些额外的 cmets（Jester 的回答很好），所以现在它可以正确组装和运行（使用 GDB 测试，使用 layout ret / tui reg float 单步执行）。 This is the diff between revisions。引入了fmul st1,st1 无效指令错误is here 的修订版，但在此之前它在完成时未能清除x87 堆栈。

只是为了好玩，我想写一个更高效的版本，只加载一次 a 和 b。

这允许更多的指令级并行性，通过首先执行涉及cos 结果的所有。即在乘以cos(ang)之前准备2*a*b，这样这些计算就可以并行运行。假设fcos 是关键路径，我的版本只有一个fmul 和一个fsubp 从fcos 结果到fsqrt 输入的延迟。

default rel   ; in case we assemble this in 64-bit mode, use RIP-relative addressing

  ... declare stuff, omitted.

    fld    qword [a]   ;load a into st0
    fld    st0         ;   st1 = a  because we'll need it again later.
    fmul   st0, st0    ;st0 = a * a = a^2

    fld    qword [b]   ;load b into st0   (pushing the a^2 result up to st1)
    fmul   st2, st0    ;   st2 = a*b
    fmul   st0, st0    ;st0 = b^2,   st1 = a^2,  st2 = a*b

    faddp              ;st0 = a^2 + b^2   st1 = a*b;        st2 empty
    fxch   st1         ;st0 = a*b         st1 = a^2 + b^2    ;  could avoid this, but only by using cos(ang) earlier, worse for critical path latency
    fadd   st0,st0     ;st0 = 2*a*b       st1 = a^2 + b^2

    fld    qword [ang]
    fcos               ;st0 = cos(ang)       st1 = 2*a*b       st2 = a^2+b^2
    fmulp              ;st0=cos(ang)*2*a*b   st1 = a^2+b^2

    fsubp  st1, st0    ;st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
    fsqrt              ;take square root of st0 = c

    fstp   qword [c]   ;store st0 in c and pop, leaving the x87 stack empty again ‒ and we're done!

当然，x87 已经过时了。在现代 x86 上，通常您会使用 SSE2 标量（或打包！）来处理任何浮点数。

x87 在现代 x86 上有两件事要做：硬件精度为 80 位（与 64 位 double 相比），它适用于小代码大小（机器代码字节，而不是指令数量或源大小）。良好的指令缓存通常意味着代码大小并不是一个足够重要的因素，无法让 x87 对 FP 代码的性能产生影响，因为它通常比 SSE2 慢，因为需要额外的指令来处理笨重的 x87 堆栈。

对于初学者或代码大小的原因，x87 具有超越函数，如 fcos 和 fsin，并且 log/exp 内置为单个指令。它们使用许多微指令进行微编码，可能不比标量库函数快，但在某些 CPU 上，您可能对它们做出的速度/精度折衷以及绝对速度感到满意。至少如果您首先使用 x87，否则您必须通过存储/重新加载将结果反弹到 XMM 寄存器或从 XMM 寄存器反弹。

sin/cos 的范围缩小不会做任何扩展精度的事情来避免非常接近 Pi 倍数的巨大相对误差，只使用 Pi 的内部 80 位（64 位有效数字）值。（库实现可能会也可能不会这样做，具体取决于所需的速度与精度权衡。）请参阅Intel Underestimates Error Bounds by 1.3 quintillion。

（当然，32 位代码中的 x87 可以让您与 Pentium III 和其他没有 SSE2 用于双精度、只有 SSE1 用于浮点或根本没有 XMM 寄存器的 CPU 兼容。x86-64 将 SSE2 作为基线，所以这个优势在 x86-64 上不存在。）

对于初学者来说，x87 的巨大缺点是跟踪 x87 堆栈寄存器，而不是让东西堆积。你可以很容易地得到只运行一次的代码，但是当你把它放在一个循环中时会给出 NaN，因为你没有平衡你的 x87 堆栈操作。

extern cos
global cosine_law_sse2_scalar
cosine_law_sse2_scalar:
    movsd   xmm0, [ang]
    call    cos           ; xmm0 = cos(ang).  Avoid using this right away so OoO exec can do the rest of the work in parallel

    movsd   xmm1, [a]
    movsd   xmm2, [b]

    movaps  xmm3, xmm1                ; copying registers should always copy the full reg, not movsd merging into the old value.
    mulsd   xmm3, xmm2   ; xmm3 = a*b

    mulsd   xmm1, xmm1   ; a^2
    mulsd   xmm2, xmm2   ; b^2

    addsd   xmm3, xmm3   ; 2*a*b

    addsd   xmm1, xmm2   ; a^2 + b^2
    mulsd   xmm3, xmm0   ; 2*a*b*cos(ang)
    subsd   xmm1, xmm3   ; (a^2 + b^2) - 2*a*b*cos(ang)

    sqrtsd  xmm0, xmm3   ; sqrt(that), in xmm0 as a return value
    ret
;; This has the work interleaved more than necessary for most CPUs to find the parallelism

这个版本在call cos 返回后只有 11 微秒。 (https://agner.org/optimize/)。它非常紧凑，非常简单。不跟踪 x87 堆栈。并且它具有与 x87 相同的良好依赖链，在我们已经拥有 2*a*b 之前不使用 cos 结果。

我们甚至可以将a 和b 作为一个128 位向量一起加载。但是然后将其拆包以对两半做不同的事情，或者从顶部元素获取b^2 作为标量，是很笨拙的。如果 SSE3 haddpd 只有 1 uop，那就太好了（如果输入正确，让我们用一条指令执行 a*b + a*b 和 a^2 + b^2），但在所有拥有它的 CPU 上都是 3 uop。

（PS 与 PD 仅对 MULSS/SD 等实际数学指令有影响。对于 FP 洗牌和寄存器副本，只需使用任何 FP 指令将您的数据放在您想要的位置，优先使用 PS/SS，因为它们的长度更短机器代码中的编码。这就是我使用movaps 的原因；movapd 总是错过优化浪费 1 个字节，除非您为了对齐而故意使指令更长。）

;; I didn't actually end up using SSE3 for movddup or haddpd, it turned out I couldn't save uops that way.
global cosine_law_sse3_less_shuffle
cosine_law_sse3_less_shuffle:
   ;; 10 uops after the call cos, if both extract_high_half operations use pshufd or let movhlps have a false dependency
   ;; or if we had AVX for  vunpckhpd  xmm3, xmm1,xmm1
   ;; and those 10 are a mix of shuffle and MUL/ADD.
    movsd   xmm0, [ang]
    call    cos           ; xmm0 = cos(ang).  Avoid using this right away so OoO exec can do the rest of the work in parallel

    movups  xmm1, [a]     ; {a, b}  (they were in contiguous memory in this order.  low element = a)
    movaps  xmm3, xmm1

   ; xorps   xmm3, xmm3   ; break false dependency by zeroing.  (xorps+movhlps is maybe better than movaps + unpckhpd, at least on SnB but maybe not Bulldozer / Ryzen)
   ; movhlps xmm3, xmm1   ; xmm3 = b
;   pshufd  xmm3, xmm1, 0b01001110   ; xmm3 = {b, a}  ; bypass delay on Nehalem, but fine on most others

    mulsd   xmm3, [b]    ; xmm3 = a*b   ; reloading b is maybe cheaper than shufling it out of the high half of xmm1
    addsd   xmm3, xmm3   ; 2*b*a
    mulsd   xmm3, xmm0   ; 2*b*a*cos(ang)

    mulpd   xmm1, xmm1   ; {a^2, b^2}

    ;xorps  xmm2, xmm2   ; we don't want to just use xmm0 here; that would couple this dependency chain to the slow cos(ang) critical path sooner.
    movhlps xmm2, xmm1
    addsd   xmm1, xmm2   ; a^2 + b^2

    subsd   xmm1, xmm3   ; (a^2 + b^2) - 2*a*b*cos(ang)

    sqrtsd  xmm0, xmm1   ; sqrt(that), in xmm0 as a return value
    ret

我们可以使用 AVX 做得更好，保存 MOVAPS 寄存器副本，因为 3 操作数非破坏性 VEX 版本的指令允许我们将结果放入新寄存器，而不破坏任何一个输入。这对于 FP shuffle 非常有用，因为 SSE* 没有任何用于 FP 操作数的复制和洗牌，只有 pshufd 这可能会导致某些 CPU 上的额外旁路延迟。因此，它保存了 MOVAPS 和（注释掉的）XORPS，它们打破了对任何产生 MOVHLPS 的 XMM2 旧值的依赖。（MOVHLPS 将目标的低 64 位替换为 src 的高 64 位，因此它对两个寄存器都有输入依赖性）。

global cosine_law_avx
cosine_law_avx:
   ;; 9 uops after the call cos.  Reloading [b] is good here instead of shuffling it, saving total uops / instructions
    vmovsd   xmm0, [ang]
    call     cos           ; xmm0 = cos(ang).  Avoid using this right away so OoO exec can do the rest of the work in parallel

    vmovups  xmm1, [a]     ; {a, b}  (they were in contiguous memory in this order.  low element = a)

    vmulsd   xmm3, xmm1, [b]  ; xmm3 = a*b

    vaddsd   xmm3, xmm3   ; 2*b*a.   (really vaddsd xmm3,xmm3,xmm3  but NASM lets us shorten when dst=src1)
    vmulsd   xmm3, xmm0   ; 2*b*a*cos(ang)

    vmulpd   xmm1, xmm1   ; {a^2, b^2}

    vunpckhpd xmm2, xmm1,xmm1  ; xmm2 = { b^2, b^2 }
    vaddsd   xmm1, xmm2   ; a^2 + b^2

    vsubsd   xmm1, xmm3   ; (a^2 + b^2) - 2*a*b*cos(ang)

    vsqrtsd  xmm0, xmm1,xmm1   ; sqrt(that), in xmm0 as a return value.  (Avoiding an output dependency on xmm0, even though it was an ancestor in the dep chain.  Maybe lets the CPU free that physical reg sooner)
    ret

我只测试了第一个 x87 版本，所以我可能错过了其他一个版本。

【讨论】：