汇编语言 (x86)：如何创建循环来计算斐波那契数列答案

【问题标题】：Assembly Language (x86): How to create a loop to calculate Fibonacci sequence汇编语言 (x86)：如何创建循环来计算斐波那契数列
【发布时间】：2015-12-16 01:22:34
【问题描述】：

我正在使用 Visual Studio 2013 Ultimate 在 MASM 中编写汇编语言 (x86)。我正在尝试使用数组来计算使用数组的 n 个元素的斐波那契数列。换句话说，我试图去一个数组元素，获取它之前的两个元素，将它们相加，然后将结果存储在另一个数组中。

我无法设置索引寄存器来完成这项工作。

我的程序设置如下：

TITLE fibonacci.asm

INCLUDE Irvine32.inc

.data
    fibInitial  BYTE 0, 1, 2, 3, 4, 5, 6
    fibComputed BYTE 5 DUP(0)

.code
main PROC

    MOVZX si, fibInitial
    MOVZX di, fibComputed
    MOV   cl, LENGTHOF fibInitial

L1:
    MOV   ax, [si - 1]
    MOV   dx, [si - 2]
    MOV   bp, ax + dx
    MOV   dl, TYPE fibInitial
    MOVZX si, dl
    MOV   [edi], bp
    MOV   dh, TYPE fibComputed
    MOVZX di, dl
    loop L1

exit
main ENDP
END main

我无法编译它，因为MOV ebp, ax + dx 行的错误消息显示“错误 A2031：必须是索引或基址寄存器”。但是，我确定我忽略了其他逻辑错误。

【问题讨论】：

MOV bp, ax + dx 不是有效的 x86 指令。在 32 位代码中，您可以使用 lea ebp, [eax + edx]（lea bp, [ax + dx] 不起作用，因为 [ax + dx] 不是有效的有效地址）。请注意，ebp 在某些情况下有特定用途，因此您可能需要考虑使用不同的注册机。
另外，您从[si - 1] 和[si - 2] 读取的尝试不正确。 si 那时没有有效的地址。
@Michael 如何在循环中引用数组当前元素下方的元素 1 或 2（忽略 fibInitial 现在没有低于 2 的元素）？
我建议您先阅读 x86 汇编教程，例如 Art Of Assembly，因为您似乎误解了一些基础知识。
是的，我正要开始写一个答案，但是有很多错误，这将是巨大的。确保跟踪何时使用mov reg, imm32 将地址放入寄存器，何时使用mov reg, [ addr ] 从内存中加载数据。

标签： assembly x86 masm fibonacci irvine32

【解决方案1】：

考虑到 fib(93) = 12200160415121876738 是适合 64 位无符号整数的最大值，尝试优化它可能没有多大意义，除非计算 fib(n) 以某个（通常是素数）数为模对于大的n。

有一种方法可以在 log₂(n) 次迭代中直接计算 fib(n)，使用 lucas 序列方法或矩阵方法进行斐波那契。卢卡斯序列更快，如下所示。可以修改这些以执行对某个数字取模的数学运算。

/* lucas sequence method */
uint64_t fibl(int n) {
    uint64_t a, b, p, q, qq, aq;
    a = q = 1;
    b = p = 0;
    while(1){
        if(n & 1) {
            aq = a*q;
            a = b*q + aq + a*p;
            b = b*p + aq;
        }
        n >>= 1;
        if(n == 0)
            break;
        qq = q*q;
        q = 2*p*q + qq;
        p = p*p + qq;
    }
    return b;
}

【讨论】：

有趣。我认为没有任何快速的方法来计算 fib(n)。对于我的回答，我花了很多时间优化设置/清理，以便尽可能快地进行短调用。我认为我的矢量版本做得很好，尤其是。如果 n 是奇数。用低 n 优化低开销很有趣，而且比只优化循环要困难得多。（那部分也很有趣，只是想看看我可以得到什么样的结果，对于一个依赖于先前计算的计算，即使 fib(n) 本身在它溢出后并不有趣.. 除非 BigInt.. .)

【解决方案2】：

.386
.model flat, stdcall
.stack 4096
ExitProcess proto, dwExitCode:dword

.data
    fib word 1, 1, 5 dup(?);you create an array with the number of the fibonacci series that you want to get
.code
main proc
    mov esi, offset fib ;set the stack index to the offset of the array.Note that this can also be set to 0
    mov cx, lengthof fib ;set the counter for the array to the length of the array. This keeps track of the number of times your loop will go

L1: ;start the loop
    mov ax, [esi]; move the first element to ax ;move the first element in the array to the ax register
    add ax, [esi + type fib]; add the second element to the value in ax. Which gives the next element in the series
    mov[esi + 2* type fib], ax; assign the addition to the third value in the array, i.e the next number in the fibonacci series
    add esi, type fib;increment the index to move to the next value
    loop L1; repeat

    invoke ExitProcess, 0
main endp
end main

【讨论】：

理想的答案应该解释他们如何解决提问者的问题。
好的，我会根据需要调整
通常这意味着一些文本在代码块之外 给出大图。此外，如果您将 cmets 缩进到一致的列，这将更具可读性，因此更容易阅读说明而不会获得文本墙效果。（有关格式/样式的示例，请参阅我对这个问题的回答中的 asm 代码块）。
在 32 位代码中 loop 使用 ECX。如果 ECX 的高字节在进入main 时碰巧非零，您的代码将中断，因为您将循环 64k 次！只需使用 ECX，或者更好的 don't use the slow loop instruction at all，并使用 cmp esi, fib + sizeof fib - 8 / jb L1。（即do {} while(p < endp)。另请注意，在循环迭代之后，ax 具有最新的 Fib(n)，因此如果您在循环之前初始化 AX，则只需在其中重新加载旧的。

【解决方案3】：

相关：Code-golf 使用扩展精度 adc 循环打印 Fib(10**9): my x86 asm answer 的前 1000 位，并将二进制转换为字符串。内部循环针对速度进行了优化，其他部分针对大小进行了优化。

计算Fibonacci sequence 只需要保持两个状态：当前元素和前一个元素。我不知道你想用fibInitial 做什么，除了计算它的长度。这不是你在 for $n (0..5) 做的 perl。

我知道你只是在学习 asm，但我还是要谈谈性能。没有太多理由学习 asm without knowing what's fast and what's not。如果您不需要性能，请让编译器从 C 源代码为您生成 asm。另请参阅https://stackoverflow.com/tags/x86/info 上的其他链接

为您的状态使用寄存器简化了在计算a[1] 时需要查看a[-1] 的问题。您以curr=1、prev=0 开头，然后以a[0] = curr 开头。要生成 Fibonacci numbers 的“现代”从零开始的序列，请从 curr=0、prev=1 开始。

你很幸运，我最近在想一个高效的斐波那契代码循环，所以我花时间写了一个完整的函数。请参阅下面的展开和矢量化版本（节省存储指令，但即使在为 32 位 CPU 编译时也可以使 64 位整数更快）：

; fib.asm
;void fib(int32_t *dest, uint32_t count);
; not-unrolled version.  See below for a version which avoids all the mov instructions
global fib
fib:
    ; 64bit SysV register-call ABI:
    ; args: rdi: output buffer pointer.  esi: count  (and you can assume the upper32 are zeroed, so using rsi is safe)

    ;; locals:  rsi: endp
    ;; eax: current   edx: prev
    ;; ecx: tmp
    ;; all of these are caller-saved in the SysV ABI, like r8-r11
    ;; so we can use them without push/pop to save/restore them.
    ;; The Windows ABI is different.

    test   esi, esi       ; test a reg against itself instead of cmp esi, 0
    jz     .early_out     ; count == 0.  

    mov    eax, 1         ; current = 1
    xor    edx, edx       ; prev    = 0

    lea    rsi, [rdi + rsi * 4]  ; endp = &out[count];  // loop-end pointer
    ;; lea is very useful for combining add, shift, and non-destructive operation
    ;; this is equivalent to shl rsi, 4  /  add rsi, rdi

align 16
.loop:                    ; do {
    mov    [rdi], eax     ;   *buf = current
    add    rdi, 4         ;   buf++

    lea    ecx, [rax + rdx] ; tmp = curr+prev = next_cur
    mov    edx,  eax      ; prev = curr
    mov    eax,  ecx      ; curr=tmp
 ;; see below for an unrolled version that doesn't need any reg->reg mov instructions

    ; you might think this would be faster:
    ; add  edx, eax    ; but it isn't
    ; xchg eax, edx    ; This is as slow as 3 mov instructions, but we only needed 2 thanks to using lea

    cmp    rdi, rsi       ; } while(buf < endp);
    jb    .loop           ; jump if (rdi BELOW rsi).  unsigned compare
    ;; the LOOP instruction is very slow, avoid it

.early_out:
    ret

另一个循环条件可以是

    dec     esi         ; often you'd use ecx for counts, but we had it in esi
    jnz     .loop

AMD CPU 可以融合 cmp/branch，但不能融合 dec/branch。 Intel CPU 也可以macro-fusedec/jnz。（或有符号小于零/大于零）。 dec/inc 不更新进位标志，因此您不能将它们与上/下无符号 ja/jb 一起使用。我认为这个想法是您可以在循环中执行adc（带进位相加），使用inc/dec 作为循环计数器以不干扰进位标志，但partial-flags slowdowns make this bad on modern CPUs。

lea ecx, [eax + edx] 需要一个额外的字节（地址大小前缀），这就是我使用 32 位目标和 64 位地址的原因。（这些是lea 在 64 位模式下的默认操作数大小）。对速度没有直接影响，只是通过代码大小间接影响。

另一个循环体可以是：

    mov  ecx, eax      ; tmp=curr.  This stays true after every iteration
.loop:

    mov  [rdi], ecx
    add  ecx, edx      ; tmp+=prev  ;; shorter encoding than lea
    mov  edx, eax      ; prev=curr
    mov  eax, ecx      ; curr=tmp

展开循环以进行更多迭代意味着更少的洗牌。而不是mov 指令，您只需跟踪哪个寄存器保存哪个变量。即，您使用一种寄存器重命名来处理分配。

.loop:     ;; on entry:       ; curr:eax  prev:edx
    mov  [rdi], eax             ; store curr
    add  edx, eax             ; curr:edx  prev:eax
.oddentry:
    mov  [rdi + 4], edx         ; store curr
    add  eax, edx             ; curr:eax  prev:edx

    ;; we're back to our starting state, so we can loop
    add  rdi, 8
    cmp  rdi, rsi
    jb   .loop

展开的问题是您需要清理剩余的任何奇怪的迭代。两个展开因子的幂可以使清理循环稍微容易一些，但是添加 12 并不比添加 16 快。（请参阅这篇文章的先前修订版，了解使用 lea 生成的愚蠢的 unroll-by-3 版本curr + prev 在第三个寄存器中，因为我没有意识到你实际上并不需要临时。感谢 rcgldr 捕捉到。）

请参阅下面的完整工作展开版本，它可以处理任何计数。

测试前端（此版本中的新功能：一个金丝雀元素，用于检测写入缓冲区末尾的 asm 错误。）

// fib-main.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void fib(uint32_t *buf, uint32_t count);

int main(int argc, const char *argv[]) {
    uint32_t count = 15;
    if (argc > 1) {
        count = atoi(argv[1]);
    }
    uint32_t buf[count+1]; // allocated on the stack
    // Fib overflows uint32 at count = 48, so it's not like a lot of space is useful

    buf[count] = 0xdeadbeefUL;
    // uint32_t count = sizeof(buf)/sizeof(buf[0]);
    fib(buf, count);
    for (uint32_t i ; i < count ; i++){
        printf("%u ", buf[i]);
    }
    putchar('\n');

    if (buf[count] != 0xdeadbeefUL) {
        printf("fib wrote past the end of buf: sentinel = %x\n", buf[count]);
    }
}

这段代码完全可以工作并经过测试（除非我错过了将本地文件中的更改复制回答案>。

peter@tesla:~/src/SO$ yasm -f elf64 fib.asm && gcc -std=gnu11 -g -Og fib-main.c fib.o
peter@tesla:~/src/SO$ ./a.out 48
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 512559680

展开版

再次感谢 rcgldr 让我思考如何在循环设置中处理奇数和偶数计数，而不是在最后进行清理迭代。

我选择了无分支设置代码，它将 4 * count%2 添加到起始指针。那可以是零，但是添加零比分支来看看我们是否应该更便宜。斐波那契数列会很快溢出寄存器，因此保持序言代码的紧凑和高效很重要，而不仅仅是循环内的代码。（如果我们要优化，我们希望针对许多短长度的调用进行优化）。

    ; 64bit SysV register-call ABI
    ; args: rdi: output buffer pointer.  rsi: count

    ;; locals:  rsi: endp
    ;; eax: current   edx: prev
    ;; ecx: tmp
    ;; all of these are caller-saved in the SysV ABI, like r8-r11
    ;; so we can use them without push/pop to save/restore them.
    ;; The Windows ABI is different.

;void fib(int32_t *dest, uint32_t count);  // unrolled version
global fib
fib:
    cmp    esi, 1
    jb     .early_out       ; count below 1  (i.e. count==0, since it's unsigned)

    mov    eax, 1           ; current = 1
    mov    [rdi], eax
    je     .early_out       ; count == 1, flags still set from cmp
    ;; need this 2nd early-out because the loop always does 2 iterations

;;; branchless handling of odd counts:
;;;   always do buf[0]=1, then start the loop from 0 or 1
;;; Writing to an address you just wrote to is very cheap
;;; mov/lea is about as cheap as best-case for branching (correctly-predicted test/jcc for count%2==0)
;;; and saves probably one unconditional jump that would be needed either in the odd or even branch

    mov    edx, esi         ;; we could save this mov by using esi for prev, and loading the end pointer into a different reg
    and    edx, eax         ; prev = count & 1 = count%2

    lea    rsi, [rdi + rsi*4] ; end pointer: same regardless of starting at 0 or 1

    lea    rdi, [rdi + rdx*4] ; buf += count%2
    ;; even count: loop starts at buf[0], with curr=1, prev=0
    ;; odd  count: loop starts at buf[1], with curr=1, prev=1

align 16  ;; the rest of this func is just *slightly* longer than 16B, so there's a lot of padding.  Tempting to omit this alignment for CPUs with a loop buffer.
.loop:                      ;; do {
    mov    [rdi], eax       ;;   *buf = current
             ; on loop entry: curr:eax  prev:edx
    add   edx, eax          ; curr:edx  prev:eax

;.oddentry: ; unused, we used a branchless sequence to handle odd counts
    mov   [rdi+4], edx
    add   eax, edx          ; curr:eax  prev:edx
                            ;; back to our starting arrangement
    add    rdi, 8           ;;   buf++
    cmp    rdi, rsi         ;; } while(buf < endp);
    jb    .loop

;   dec   esi   ;  set up for this version with sub esi, edx; instead of lea
;   jnz   .loop
.early_out:
    ret

要生成从零开始的序列，请执行以下操作

curr=count&1;   // and esi, 1
buf += curr;    // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi

而不是当前

curr = 1;
prev = count & 1;
buf += count & 1;

我们还可以在两个版本中保存一条mov 指令，方法是使用esi 来保存prev，现在prev 依赖于count。

  ;; loop prologue for sequence starting with 1 1 2 3
  ;; (using different regs and optimized for size by using fewer immediates)
    mov    eax, 1               ; current = 1
    cmp    esi, eax
    jb     .early_out           ; count below 1
    mov    [rdi], eax
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    and    esi, eax             ; prev = count & 1
    lea    rdi, [rdi + rsi*4]   ; buf += count & 1
  ;; eax:curr esi:prev    rdx:endp  rdi:buf
  ;; end of old code

  ;; loop prologue for sequence starting with 0 1 1 2
    cmp    esi, 1
    jb     .early_out           ; count below 1, no stores
    mov    [rdi], 0             ; store first element
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    mov    eax, 1               ; prev = 1
    and    esi, eax             ; curr = count&1
    lea    rdi, [rdi + rsi*4]   ; buf += count&1
    xor    eax, esi             ; prev = 1^curr
    ;; ESI:curr EAX:prev  (opposite of other setup)
  ;;

  ;; optimized for code size, NOT speed.  Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
  ;; most of the savings are from avoiding mov reg, imm32,
  ;; and from counting down the loop counter, instead of checking an end-pointer.
  ;; loop prologue for sequence starting with 0 1 1 2
    xor    edx, edx
    cmp    esi, 1
    jb     .early_out         ; count below 1, no stores
    mov    [rdi], edx         ; store first element
    je     .early_out         ; count == 1, flags still set from cmp

    xor    eax, eax  ; movzx after setcc would be faster, but one more byte
    shr    esi, 1             ; two counts per iteration, divide by two
  ;; shift sets CF = the last bit shifted out
    setc   al                 ; curr =   count&1
    setnc  dl                 ; prev = !(count&1)

    lea    rdi, [rdi + rax*4] ; buf+= count&1

  ;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
  ;; EAX:curr EDX:prev  (same as 1 1 2 setup)
  ;; even count: loop starts at buf[0], with curr=0, prev=1
  ;; odd  count: loop starts at buf[1], with curr=1, prev=0

  .loop:
       ...
    dec  esi                  ; 1B smaller than 64b cmp, needs count/2 in esi
    jnz .loop
  .early_out:
    ret

矢量化：

斐波那契数列并不是特别可并行化的。没有简单的方法可以从 F(i) 和 F(i-4) 或类似的东西中得到 F(i+4)。我们可以对向量做的就是减少对内存的存储。开始：

a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]

然后a+=b; b+=a; a+=b; b+=a; 产生：

a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]

将两个 64 位整数打包到一个 128b 向量中时，这不那么愚蠢。即使在 32 位代码中，您也可以使用 SSE 进行 64 位整数数学运算。

此答案的先前版本具有未完成的打包 32 位矢量版本，无法正确处理 count%4 != 0。为了加载序列的前 4 个值，我使用了pmovzxbd，所以当我只能使用 4B 时我不需要 16B 的数据。将序列的第一个 -1 .. 1 值放入向量寄存器要容易得多，因为只有一个非零值可以加载和随机播放。

;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
    mov eax, 1
    movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
    pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]

    sub esi, 2
    jae .entry  ; make the common case faster with fewer branches
    ;; could put the handling for count==0 and count==1 right here, with its own ret

    jmp .cleanup
align 16
.loop:                          ; do {
    paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
    ;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
    ;; xmm0: [ f1 f0  ]
    paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
    movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
        add     rdi, 16         ;   buf += 2
    sub esi, 2
        jae   .loop             ; } while((count-=2) >= 0);
    .cleanup:
    ;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0

    ;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
    ;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
    cmp esi, -1
    jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
    ;; xmm1 = [f1 f0]
    movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
    .out:
        ret

没有必要进一步展开，dep 链延迟限制了吞吐量，因此我们总是可以平均每个周期存储一个元素。减少 uops 中的循环开销有助于超线程，但这非常小。

如您所见，即使在展开 2 时处理所有极端情况也很难跟踪。它需要额外的启动开销，即使您试图优化它以将其保持在最低限度。很容易得到很多条件分支。

更新的主要内容：

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>

#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif

#define xstr(s) str(s)
#define str(s) #s

int main(int argc, const char *argv[]) {
    uint32_t count = 15;
    if (argc > 1) {
        count = atoi(argv[1]);
    }
    int benchmark = argc > 2;

    buftype_t buf[count+1]; // allocated on the stack
    // Fib overflows uint32 at count = 48, so it's not like a lot of space is useful

    buf[count] = CANARY;
    // uint32_t count = sizeof(buf)/sizeof(buf[0]);
    if (benchmark) {
    int64_t reps = 1000000000 / count;
    for (int i=0 ; i<=reps ; i++)
        FIB_FN(buf, count);

    } else {
    FIB_FN(buf, count);
    for (uint32_t i ; i < count ; i++){
        printf("%" FMTu " ", buf[i]);
    }
    putchar('\n');
    }
    if (buf[count] != CANARY) {
        printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "\n", buf[count]);
    }
}

性能

对于刚好低于 8192 的计数，在我的 Sandybridge i5 上，由两个展开的非向量版本的理论最大吞吐量接近每个周期 1 个存储（每个周期 3.5 条指令）的理论最大吞吐量。 8192 * 4B/int = 32768 = L1 缓存大小。在实践中，我看到 ~3.3 到 ~3.4 insn / 周期。不过，我正在用 Linux perf 计算整个程序，而不仅仅是紧密循环。

无论如何，进一步展开没有任何意义。显然这在 count=47 之后不再是斐波那契数列，因为我们使用了 uint32_t。然而，对于大的count，吞吐量受到内存带宽的限制，低至~2.6 insn / 周期。在这一点上，我们基本上是在研究如何优化 memset。

64 位向量版本以每个周期 3 个 insns（每两个时钟一个 128b 存储）运行，阵列大小约为 L2 缓存大小的 1.5 倍。（即./fib64 49152）。随着阵列大小增加到 L3 缓存大小的较大部分，性能下降到每周期约 2 insn（每 3 个时钟一次存储），在 L3 缓存大小的 3/4 处。在大小 > L3 缓存的情况下，它每 6 个周期平均 1 个存储。

因此，当我们适合 L2 而不是 L1 缓存时，使用向量存储会更好。

【讨论】：

您可以将循环展开为两次迭代，在您的示例中在 ecx 和 edx 之间交替，因为不需要在 eax 中保留值：|添加 ecx,edx | ... |添加 edx,ecx | .
@rcgldr：谢谢！ IDK我怎么没看到，并挂断了使用第三块存储。（请参阅我在上一个版本中的 unrolled-by-3 版本）。我正在查看一个使用 temp 的非展开 C 版本，但不知何故未能看到 prev 在生成新 curr 的同一步骤中变得不需要。更新了我的答案以简化展开。
您可以通过更改用于 ecx 和 edx 的初始值来预先处理奇数情况，然后分支到循环的中间。初始化： | mov edx,count |移动 eax,1 |和edx,eax |子 eax,edx | （或反向 eax / edx，取决于循环）。
@rcgldr: 分支是为小孩子准备的 :P 另一个很好的建议，不过。使用无分支版本进行了更新（如果您在开始时不计算额外的 jcc，则为特殊情况 count==1 和 count==0，但除非有人实际调用，否则这些都将被完美预测计数movs 之后的第二个分支:) 即使在不喜欢的CPU上也应该很好查看一组 4 个 insns 中的多个分支。（我们知道解码将从 fn 入口点开始。）
@rcgldr: en.wikipedia.org/wiki/Fibonacci_number 说任何一种方式都是有效的。我想我可以通过 prev=1; curr=0; 让代码从 0 开始。对于奇数，我们不覆盖buf[0]，prev=0; curr=1; 所以，curr=count&1; buf+=curr; prev=1 ^ curr;