在 LLVM 中调用 fsincos 指令比调用 libc sin/cos 函数慢？答案

【问题标题】：Calling fsincos instruction in LLVM slower than calling libc sin/cos functions?在 LLVM 中调用 fsincos 指令比调用 libc sin/cos 函数慢？
【发布时间】：2012-09-11 04:47:57
【问题描述】：

我正在研究一种使用 LLVM 编译的语言。只是为了好玩，我想做一些微基准测试。一方面，我在一个循环中运行了数百万次 sin / cos 计算。在伪代码中，它看起来像这样：

var x: Double = 0.0
for (i <- 0 to 100 000 000)
  x = sin(x)^2 + cos(x)^2
return x.toInteger

如果我使用 LLVM IR 内联汇编计算 sin/cos，格式如下：

%sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind

这比分别使用 fsin 和 fcos 而不是 fsincos 更快。但是，它比我单独调用 llvm.sin.f64 和 llvm.cos.f64 内部函数要慢，它们编译为对 C 数学库函数的调用，至少在我使用的目标设置下（启用 SSE 的 x86_64）。

LLVM 似乎在单/双精度 FP 之间插入了一些转换——这可能是罪魁祸首。这是为什么？抱歉，我是大会的新手：

    .globl  main
    .align  16, 0x90
    .type   main,@function
main:                                   # @main
    .cfi_startproc
# BB#0:                                 # %loopEntry1
    xorps   %xmm0, %xmm0
    movl    $-1, %eax
    jmp     .LBB44_1
    .align  16, 0x90
.LBB44_2:                               # %then4
                                    #   in Loop: Header=BB44_1 Depth=1
    movss   %xmm0, -4(%rsp)
    flds    -4(%rsp)
    #APP
    fsincos
    #NO_APP
    fstpl   -16(%rsp)
    fstpl   -24(%rsp)
    movsd   -16(%rsp), %xmm0
    mulsd   %xmm0, %xmm0
    cvtsd2ss        %xmm0, %xmm1
    movsd   -24(%rsp), %xmm0
    mulsd   %xmm0, %xmm0
    cvtsd2ss        %xmm0, %xmm0
    addss   %xmm1, %xmm0
.LBB44_1:                               # %loop2
                                    # =>This Inner Loop Header: Depth=1
    incl    %eax
    cmpl    $99999999, %eax         # imm = 0x5F5E0FF
    jle     .LBB44_2
# BB#3:                                 # %break3
    cvttss2si       %xmm0, %eax
    ret
.Ltmp160:
    .size   main, .Ltmp160-main
    .cfi_endproc

调用 llvm sin/cos 内在函数的相同测试：

    .globl  main
    .align  16, 0x90
    .type   main,@function
main:                                   # @main
    .cfi_startproc
# BB#0:                                 # %loopEntry1
    pushq   %rbx
.Ltmp162:
    .cfi_def_cfa_offset 16
    subq    $16, %rsp
.Ltmp163:
    .cfi_def_cfa_offset 32
.Ltmp164:
    .cfi_offset %rbx, -16
    xorps   %xmm0, %xmm0
    movl    $-1, %ebx
    jmp     .LBB44_1
    .align  16, 0x90
.LBB44_2:                               # %then4
                                    #   in Loop: Header=BB44_1 Depth=1
    movsd   %xmm0, (%rsp)           # 8-byte Spill
    callq   cos
    mulsd   %xmm0, %xmm0
    movsd   %xmm0, 8(%rsp)          # 8-byte Spill
    movsd   (%rsp), %xmm0           # 8-byte Reload
    callq   sin
    mulsd   %xmm0, %xmm0
    addsd   8(%rsp), %xmm0          # 8-byte Folded Reload
.LBB44_1:                               # %loop2
                                    # =>This Inner Loop Header: Depth=1
    incl    %ebx
    cmpl    $99999999, %ebx         # imm = 0x5F5E0FF
    jle     .LBB44_2
# BB#3:                                 # %break3
    cvttsd2si       %xmm0, %eax
    addq    $16, %rsp
    popq    %rbx
    ret
.Ltmp165:
    .size   main, .Ltmp165-main
    .cfi_endproc

您能否建议使用 fsincos 的理想装配是什么样的？ PS。向 llc 添加 -enable-unsafe-fp-math 会使转换消失并切换到双精度（fldl 等），但速度保持不变。

    .globl  main
    .align  16, 0x90
    .type   main,@function
main:                                   # @main
    .cfi_startproc
# BB#0:                                 # %loopEntry1
    xorps   %xmm0, %xmm0
    movl    $-1, %eax
    jmp     .LBB44_1
    .align  16, 0x90
.LBB44_2:                               # %then4
                                    #   in Loop: Header=BB44_1 Depth=1
    movsd   %xmm0, -8(%rsp)
    fldl    -8(%rsp)
    #APP
    fsincos
    #NO_APP
    fstpl   -24(%rsp)
    fstpl   -16(%rsp)
    movsd   -24(%rsp), %xmm1
    mulsd   %xmm1, %xmm1
    movsd   -16(%rsp), %xmm0
    mulsd   %xmm0, %xmm0
    addsd   %xmm1, %xmm0
.LBB44_1:                               # %loop2
                                    # =>This Inner Loop Header: Depth=1
    incl    %eax
    cmpl    $99999999, %eax         # imm = 0x5F5E0FF
    jle     .LBB44_2
# BB#3:                                 # %break3
    cvttsd2si       %xmm0, %eax
    ret
.Ltmp160:
    .size   main, .Ltmp160-main
    .cfi_endproc

【问题讨论】：

嗯.. 我想我开始明白了。 fsin/fcos/fsincos 使用 x87 寄存器，而 mulsd addd 使用 MMX / SSE。那么开销可能来自在它们之间移动数据？
不，cvtsd2ss 是从 double 到 float 的转换。但是远离传统的协处理器指令，它们比现在的库例程更慢且更不精确。例如见gcc.gnu.org/ml/gcc/2012-02/msg00188.html
是的，移动会产生额外的开销，但与 fsincos 使用的 200-300 个周期相比，这并不算多。
谢谢，我想我会坚持使用 llvm sin/cos 内在函数。

标签： assembly llvm inline-assembly x87

【解决方案1】：

硬件触发很慢。

太多的文档声称像fsin 或fsincos 这样的x87 指令是处理三角函数的最快方法。这些说法通常是错误的。

最快的方法取决于您的 CPU。随着 CPU 变得更快，像 fsin 这样的旧硬件触发指令并没有跟上步伐。对于某些 CPU，使用正弦或其他三角函数的多项式逼近的软件函数现在比硬件指令快。

总之fsincos太慢了。

硬件触发器已过时。

有足够的证据表明 x86-64 平台已经远离硬件触发。

对于浮点数，amd64 更喜欢 SSE 而不是 x87。然而，SSE 没有类似 fsin 这样的 x87 指令。
对于 amd64，FreeBSD 和 glibc 中的 libm 在软件中实现 sin() 和此类功能，而不是 x87 触发。 glibc 具有多项式近似的optimized x86-64 assembly for sinf()（单精度正弦），而不是 x87 的fsin。 NetBSD 和 OpenBSD 做出了相反的选择：他们的 amd64 库确实使用 x87 指令。
Steel Bank Common Lisp 在其x86 backend 中使用fsin，但在其x86-64 后端未使用。对于 x86-64，SBCL 编译 calls sin() in libm 的代码。

硬件三连败。

我从 2010 年开始在 AMD Phenom II X2 560 (3.3 GHz) 上对硬件和软件正弦进行计时。我用这个循环编写了一个 C 程序：

volatile double a, s;
/* ... */
for (i = 0; i < 100000000; i++)
        s = sin(a);

我用两个不同的 sin() 实现编译了这个程序两次。硬 sin() 使用 x87 fsin。软 sin() 使用多项式逼近。我的 C 编译器 gcc -O2 没有用内联 fsin 替换我的 sin() 调用。

这里是 sin(0.5) 的结果：

$ time race-hard 0.5
    0m3.40s real     0m3.40s user     0m0.00s system
$ time race-soft 0.5
    0m1.13s real     0m1.15s user     0m0.00s system

这里的 soft sin(0.5) 太快了，这个 CPU 的 soft sin(0.5) 和 soft cos(0.5) 比一个 x87 fsin 快。

对于罪（123）：

$ time race-hard 123
    0m3.61s real     0m3.62s user     0m0.00s system
$ time race-soft 123
    0m3.08s real     0m3.07s user     0m0.01s system

Soft sin(123) 比 soft sin(0.5) 慢，因为 123 对于多项式来说太大了，所以函数必须减去 2π 的某个倍数。如果我还想要 cos(123)，那么对于 2010 年的这款 CPU，x87 fsincos 有可能比 soft sin(123) 和 soft cos(123) 更快。

【讨论】：

我确认：即使在我老旧的 Intel Xeon E5420 上，一百万条 fSinCos 汇编指令对 System:Math.Sin+System.Math.Cos 101 毫秒也需要 644 毫秒