使用 Newton-Raphson 方法在 x87 FPU 上的立方根答案

【问题标题】：Cube root on x87 FPU using Newton-Raphson method使用 Newton-Raphson 方法在 x87 FPU 上的立方根
【发布时间】：2016-08-23 18:21:57
【问题描述】：

我正在尝试使用 8086 处理器编写一个汇编程序，该程序将找到一个数字的立方根。显然我使用的是浮点数。

基于Newton-Raphson method的算法：

root := 1.0; 
repeat
     oldRoot := root;
     root := (2.0*root + x/(root*root)) / 3.0 
until ( |root – oldRoot| < 0.001;

如何将 (2*root + x) 除以 (root*root)？

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
inttwo  DWORD   2
itThree DWORD   3
three   REAL4   3.0
x       DOWRD   27


.CODE
main    PROC
        finit           ; initialize FPU
        fld     root    ; root in ST
        fmul    two     ; root*two
        fadd    x       ; root*two+27

        fld     root    ; root in ST
        fimul    root    ; root*root

        mov     eax, 0  ; exit
        ret
main    ENDP 
END

我想我不明白堆栈中什么位置的内容。产品是否适合线

fimul 根;根*根

进入 ST(1)？编辑不，它进入 st(0) st(0) 中的内容被推下堆栈到 st(1)

但我还没有想出我的问题的答案... 我如何划分？ 现在我看到 我需要将 st(1) 除以 st(0) 但是我不知道怎么做。我试过了。

finit           ; initialize FPU
fld     root    ; root in ST
fmul    two     ; root*two
fadd    xx      ; root*two+27
; the answer to root*two+x is stored in ST(0) when we load root st(0) moves to ST1 and we will use ST0 for the next operation

fld     root    ; root in ST previous content is now in ST1
fimul   root    ; root*root
fidiv   st(1)

编辑：我把公式写错了。这就是我要找的。p>

(2.0*root) + x / (root*root)) / 3.0 That's what I need. 
STEP 1) (2 * root) 
STEP 2) x / (root * root) 
STEP 3) ADD step one and step 2 
STEP 4) divide step 3 by 3.0

根 = (2.0*1.0) + 27/(1.0*1.0) / 3.0 ; (2) + 27/(1.0) / 3.0 = 11 ==> 根 = 11

EDIT2：新代码！！

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
three   REAL4   3.0
xx      REAL4   27.0


.CODE
main    PROC
        finit           ; initialize FPU
                fld     root    ; root in ST    ; Pass 1 ST(0) has 1.0  
repreatAgain:
        ;fld    st(2)

        fmul    two     ; root*two      ; Pass 1 ST(0) has 2                                                                            Pass 2 ST(0) = 19.333333 st(1) = 3.0 st(2) = 29.0 st(3) = 1.0

        ; the answer to roor*two is stored in ST0 when we load rootSTO moves to ST1 and we will use ST0 for the next operation
        fld     root    ; root in ST(0) previous content is now in ST(1)      Pass 1 ST(0) has 1.0 ST(1) has 2.0                        Pass 2 st(
        fmul    st(0), st(0)    ; root*root                                 ; Pass 1 st(0) has 1.0 st(1) has 2.0
        fld     xx                                                          ; Pass 1 st(0) has 27.0 st(1) has 1.0 st(2) has 2.0
        fdiv    st(0), st(1) ; x / (root*root)  ; Pass 1: 27 / 1              Pass 1 st(0) has 27.0 st(1) has 2.0 st(2) has 2.0
        fadd    st(0), st(2) ; (2.0*root) + x / (root*root))                  Pass 1 st(0) has 29.0 st(1) has 1.0 st(2) has 2.0

        fld     three                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) has 1.0 st(3) has 2.0

        fld     st(1)                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) = 1.0 st(3) = 2.0
        fdiv    st(0), st(1) ; (2.0*root) + x / (root*root)) / 3.0            Pass 1 st(1) has 9.6666666666667



        jmp     repreatAgain
        mov     eax, 0  ; exit
        ret
main    ENDP 
END

【问题讨论】：

在我看代码之前，我对这个问题有点困惑。您的部分计算是 root := (2.0*root + x/(root*root)) / 3.0 。因为运算符的优先顺序与 root := ((2.0*root) + (x/(root*root))) / 3.0 相同。所以我的问题是这个。你确定你想将(2*root + x)除以(root*root)吗？问题是错误的，或者您实际尝试解决的方程式与问题不匹配。
除了 Michael 的 cmets，您可能正在寻找 fdivp 指令。
(2.0*root) + x / (root*root)) / 3.0 这就是我需要的。所以你是对的。 STEP 1) (2 * root) STEP 2) x / (root * root) STEP 3) 将步骤 1 和步骤 2 相加 STEP 4) 将步骤 3 除以 3.0
@MichaelPetch：我将循环体减少到仅比计算实际需要的开销多两个insns（fld st(0) 和fxch）。我认为我在基因上无法查看任何东西而不想要优化它，不发布结果似乎很遗憾。
我可能读错了，但您依赖的内存地址 root 也包含上一次传递的值，但我没有看到您将新的根值写回您的 @987654333 @ 循环底部的内存变量。

标签： assembly x86 masm newtons-method x87

【解决方案1】：

英特尔的 insn 参考手册记录了所有说明，包括 fdiv 和 fdivr（x/y 而不是 y/x）。如果你真的需要学习大部分过时的 x87 (fdiv) 而不是 SSE2 (divss)，那么this x87 tutorial is essential reading，尤其是。解释寄存器堆栈的早期章节。另见this x87 FP comparison Q&A。在x86 标签 wiki 中查看更多链接。

re: EDIT2 代码转储：

循环中有 4 个fld 指令，但没有p-后缀操作。您的循环将在第 3 次迭代时溢出 8 寄存器 FP 堆栈，此时您将获得 NaN。（具体来说，不定值 NaN，printf 打印为 1#IND。

我建议设计您的循环，使迭代以st(0) 中的root 开始，并以st(0) 中的下一次迭代的root 值结束。不要在循环内向/从root 加载或存储。使用fld1 在循环外加载1.0 作为初始值，在循环后使用fstp [root] 将st(0) 弹出到内存中。

你选择了最不方便的方式来做 tmp / 3.0

                          ; stack = tmp   (and should otherwise be empty once you fix the rest of your code)
    fld     three         ; stack = 3.0, tmp
    fld     st(1)         ; stack = tmp, 3.0, tmp   ; should have used fxchg to just swap instead of making the stack deeper
    fdiv    st(0), st(1)  ; stack = tmp/3.0, 3.0, tmp

fdiv、fsub 等有多种寄存器-寄存器形式：一种st(0) 是目标，另一种是源。以st(0) 为源的表单也可与pop 一起使用，因此您可以

    fld     three         ; stack = 3.0, tmp
    fdivp                 ; stack = tmp / 3.0  popping the stack back to just one entry
    ; fdivp  st(1), st(0) ; this is what fdivp with no operands means

如果你直接使用内存操作数而不是加载它，它实际上比这更简单。既然你想要st(0) /= 3.0，你可以做fdiv [three]。在这种情况下，FP 操作就像整数操作一样，您可以在其中使用div dword ptr [integer_from_memory] 来使用内存源操作数。

非交换运算（减法和除法）也有反向版本（例如fdivr），它可以为您节省fxchg 或让您使用内存操作数，即使您需要3.0/tmp 而不是tmp/3.0

除以 3 与乘以 1/3 相同，fmul 比 fdiv 快得多。从代码简单性的角度来看，乘法是可交换的，因此实现st(0) /= 3 的另一种方式是：

fld    [one_third]
fmulp                  ; shorthand for  fmulp st(1), st(0)

; or
fmul   [one_third]

请注意，1/3.0 没有二进制浮点的精确表示，但 +/- 大约 2^23 之间的所有整数都可以（单精度 REAL4 的尾数大小）。仅当您期望使用 3 的整数倍时，您才应该关心这一点。

对原代码的注释：

您可以通过提前执行2.0 / 3.0 和x/3.0 将一个部门提升到循环之外。如果您希望循环平均运行一次以上的迭代，这是值得的。

您可以使用fld st(0) 复制堆栈顶部，因此您不必一直从内存中加载。

fimul [root] (integer mul) 是一个错误：您的root 是REAL4（32 位浮点）格式，而不是整数。 fidiv 同样是一个错误，当然不适用于 x87 寄存器作为源操作数。

由于您在堆栈顶部有root，我认为您可以只使用fmul st(0) 将st(0) 用作显式和隐式操作数，从而得到st(0) = st(0) * st(0)，而深度没有变化堆栈。

您也可以使用 sqrt 作为比 1.0 更好的初始近似值，或者 +/-1 * sqrtf(fabsf(x))。我没有看到将一个浮点数的符号应用于另一个浮点数的 x87 指令，只是 fchs 无条件翻转，fabs 无条件清除符号位。有一个fcmov，但它需要 P6 或更高版本的 CPU。您提到了 8086，但随后使用了 .586，所以 IDK 您的目标是什么。

更好的循环体：

没有经过调试或测试，但是您的代码充满了来自相同数据的重复加载，这让我抓狂。这个优化版本在这里是因为我很好奇，而不是因为我认为它会直接帮助 OP。

另外，希望这是一个很好的例子，说明如何在代码中注释数据流很棘手。（例如 x87 或带有随机播放的矢量化代码）。

## x/3.0 in st(1)
## 2.0/3.0 in st(2)

# before each iteration: st(0) = root
#  after each iteration: st(0) = root * 2.0/3.0 + (x/3.0 / (root*root)), with constants undisturbed

loop_body:
    fld     st(0)         ; stack: root, root, 2/3, x/3
    fmul    st(0), st(0)  ; stack: root^2, root, 2/3, x/3
    fdivr   st(0), st(3)  ; stack: x/3 / root^2, root, 2/3, x/3
    fxchg   st(1)         ; stack: root, x/3/root^2, 2/3, x/3
    fmul    st(0), st(2)  ; stack: root*2/3, x/3/root^2, 2/3, x/3
    faddp                 ; stack: root*2/3 + x/3/root^2, 2/3, x/3

; TODO: compare and loop back to loop_body

    fstp    [root]         ; store and pop
    fstp    st(0)          ; pop the two constants off the FP stack to empty it before returning
    fstp    st(0)
    ; finit is very slow, ~80cycles, don't use it if you don't have to.

32 位函数调用约定在 st(0) 中返回 FP 结果，因此您可以这样做，但调用者可能必须存储在某个地方。

【讨论】：

我还要指出，除以三可能更好地表示为乘以三分之一。

【解决方案2】：

我将为那些可能面临需要在 FPU 上完成的计算的 x87 新手回答这个问题。

有两件事需要考虑。如果给你一个计算（INFIX notation），比如：

root := (2.0*root + x/(root*root)) / 3.0

有没有办法将其转换为 x87 FPU 可以使用的基本指令？是的，在非常基本的层面上，x87 FPU 是一个堆栈，其作用类似于复杂的RPN 计算器。您的代码中的等式是 INFIX 表示法。如果将其转换为 POSTFIX(RPN) 表示法，它可以很容易地实现为带有操作的堆栈。

document 提供了一些关于转换为 POSTFIX 表示法的信息。遵循您的 POSTFIX 等效规则如下所示：

2.0 root * x root root * / + 3.0 /

您可以使用 root=1 和 x=27 的这些键将其放入像 HP 15C 这样的旧 RPN 计算器 (HP)：

2.0 [enter] root * x [enter] root [enter] root * / + 3.0 /

在线 HP 15C 应显示计算结果为 9.667。将其转换为基本的 x87：

数字是压栈顶 (fld)
变量是压入栈顶 (fld)
* 是 fmulp（ST(1) 乘以 ST(0)，结果存入 ST(1)，然后弹出寄存器栈）
/是fdivp（ST(1)除以ST(0)，结果存入ST(1)，弹出寄存器栈）
+是faddp（将ST(0)加到ST(1)，结果存入ST(1)，弹出寄存器栈）
- 为 fsubp（ST(1) 减去 ST(0)，结果存入 ST(1)，弹出寄存器栈）

您可以直接将2.0 root * x root root * / + 3.0 / 转换为 x87 指令：

fld Two      ; st(0)=2.0
fld root     ; st(0)=root, st(1)=2.0
fmulp        ; st(0)=(2.0 * root)
fld xx       ; st(0)=x, st(1)=(2.0 * root)
fld root     ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
fld root     ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
fmulp        ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
fdivp        ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
faddp        ; st(0)=(2.0 * root) + (x / (root * root))
fld Three    ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
fdivp        ; st(0)=((2.0 * root) + (x / (root * root))) / 3.0

掌握了基础知识后，就可以继续提高效率了。

关于编辑 2 / 后续问题

要记住的一件事是，如果您不使用将值从堆栈中弹出的指令，则循环的每次迭代都会消耗更多的 FPU 堆栈槽。通常以 P 结尾的 FPU 指令将值从堆栈中弹出。您不使用任何指令将项目从堆栈中移除，FPU 堆栈会不断增长。

与用户空间中的程序堆栈不同，FPU 堆栈非常有限，因为它只有 8 个插槽。如果您将超过 8 个活动值放入堆栈，您将收到 1#IND 形式的溢出错误。如果我们分析您的代码并在每条指令之后查看堆栈，我们会发现：

    fld     root            ; st(0)=root  
repreatAgain:
    fmul    two             ; st(0)=(2.0*root)      
    fld     root            ; st(0)=root, st(1)=(2.0*root) 
    fmul    st(0), st(0)    ; st(0)=(root*root), st(1)=(2.0*root)
    fld     xx              ; st(0)=x, st(1)=(root*root), st(2)=(2.0*root)
    fdiv    st(0), st(1)    ; st(0)=(x/(root*root)), st(1)=(root*root), st(2)=(2.0*root)
    fadd    st(0), st(2)    ; st(0)=((2.0*root) + x/(root*root)), st(1)=(root*root), st(2)=(2.0*root)
    fld     three           ; st(0)=3.0, st(1)=((2.0*root) + x/(root*root)), st(2)=(root*root), st(3)=(2.0*root)                                            
    fld     st(1)           ; st(0)=((2.0*root) + x/(root*root)), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root), st(4)=(2.0*root)
    fdiv    st(0), st(1)    ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root), st(4)=(2.0*root)
    jmp     repreatAgain

观察到在最后一个 FDIV 指令之后和 JMP 之前，我们在堆栈上有 5 个项目（st(0) 到 st(4))。当我们进入循环时，我们只有 1 个，即 st(0) 中的 root。解决此问题的最佳方法是使用指令，使值随着计算的进行从堆栈中弹出（删除）。

另一种效率较低的方法是在重复循环之前释放堆栈中不再需要的值。 FFREE 指令可用于此目的，方法是从堆栈底部开始手动标记未使用的条目。如果您在上面的代码之后和jmp repreatAgain 之前添加这些行，代码应该可以工作：

ffree   st(4)           ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root)
ffree   st(3)           ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root))
ffree   st(2)           ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0
ffree   st(1)           ; st(0)=(((2.0*root) + x/(root*root))/3.0)
fst     root            ; Update root variable
jmp     repreatAgain

通过使用 FFREE 指令，我们仅在 st(0) 中以新的root 结束循环。

由于您的计算方式，我还添加了fst root。您的计算包括 fld root，它依赖于每个循环完成时更新的 root 中的值。有一种更有效的方法可以做到这一点，但我提供的修复程序可以在您当前的代码中正常工作，而无需太多返工。

如果您使用我之前提供的低效/简单代码 sn-p 进行计算，您最终会得到如下代码：

    finit        ; initialize FPU
repreatAgain:
    fld Two      ; st(0)=2.0
    fld root     ; st(0)=root, st(1)=2.0
    fmulp        ; st(0)=(2.0 * root)
    fld xx       ; st(0)=x, st(1)=(2.0 * root)
    fld root     ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
    fld root     ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
    fmulp        ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
    fdivp        ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
    faddp        ; st(0)=(2.0 * root) + (x / (root * root))
    fld Three    ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
    fdivp        ; newroot = st(0)=((2.0 * root) + (x / (root * root))) / 3.0
    fstp root    ; Store result at top of stack into root and pop value
                 ;     at this point the stack is clear again since
                 ;     all items pushed have been popped.

    jmp repreatAgain

此代码不需要 FFREE，因为随着计算的进行，元素会从堆栈中弹出。 FADDP、FSUBP、FDIVP、FADDP 指令还会将值从栈顶弹出。这样做的副作用是使堆栈不参与部分中间计算。

集成循环

要将循环集成到我之前创建的简单/低效代码中，您可以使用FCOM (Floating point compare) 的变体进行比较。然后将浮点比较的结果传输/转换为常规 CPU 标志 (EFLAGS)。然后可以使用常规比较运算符来执行条件检查。代码可能如下所示：

epsilon REAL4   0.001

.CODE
main PROC
    finit              ; initialize FPU

repeatAgain:
    fld Two            ; st(0)=2.0
    fld root           ; st(0)=root, st(1)=2.0
    fmulp              ; st(0)=(2.0 * root)
    fld xx             ; st(0)=x, st(1)=(2.0 * root)
    fld root           ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
    fld root           ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
    fmulp              ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
    fdivp              ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
    faddp              ; st(0)=(2.0 * root) + (x / (root * root))
    fld Three          ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
    fdivp              ; newroot=st(0)=((2.0 * root) + (x / (root * root))) / 3.0
    fld root           ; st(0)=oldroot, st(1)=newroot
    fsub st(0), st(1)  ; st(0)=(oldroot-newroot), st(1)=newroot
    fabs               ; st(0)=(|oldroot-newroot|), st(1)=newroot
    fld epsilon        ; st(0)=0.001, st(1)=(|oldroot-newroot|), st(2)=newroot
    fcompp             ; Do compare&set x87 flags pop top two values off stack
                       ;     st(0)=newroot    
    fstsw ax           ; Copy x87 Status Word containing the result to AX
    fwait              ; Insure previous instruction completed
    sahf               ; Transfer condition codes to the CPU's flags register

    fstp root          ; Store result (newroot) at top of stack into root 
                       ;     and pop value. At this point the stack is clear
                       ;     again since all items pushed have been popped.
    jbe repeatAgain    ; If 0.001 <= (|oldroot-newroot|) repeat
    mov eax, 0         ; exit
    ret
main    ENDP 
END

注意：FCOMPP 的使用和手动将 x87 标志转换为 CPU 标志是由代码顶部的 .586 指令驱动的。我假设因为您没有指定 .686 或更高版本，所以像 FCOMI 这样的指令不可用。如果您使用的是.686 或更高版本，那么代码的底部可能如下所示：

fld root           ; st(0)=oldroot, st(1)=newroot
fsub st(0), st(1)  ; st(0)=(oldroot-newroot), st(1)=newroot
fabs               ; st(0)=(|oldroot-newroot|), st(1)=newroot
fld epsilon        ; st(0)=0.001, st(1)=(|oldroot-newroot|), st(2)=newroot
fcomip st(0),st(1) ; Do compare & set CPU flags, pop one value off stack
                   ;     st(0)=(|oldroot-newroot|), st(1)=newroot
fstp st(0)         ; Pop temporary value off top of stack
                   ;     st(0)=newroot

fstp root          ; Store result (newroot) at top of stack into root 
                   ;     and pop value. At this point the stack is clear
                   ;     again since all items pushed have been popped.
jbe repeatAgain    ; If 0.001 <= (|oldroot-newroot|) repeat

从中缀表示法创建 RPN/Postfix 的快速方法

如果学习将 Infix 表示法转换为 RPN/Postfix 似乎与我之前在我的问题中链接的文档相比有点令人生畏，那么会有一些缓解。有许多网站可以为您完成这项工作。一个这样的网站是MathBlog。只需输入您的方程式，单击转换，它应该会显示 RPN/Postfix 等效项。它仅限于 +-/*、括号和带 ^ 的指数。

优化

优化代码的一大关键是通过将每个循环之间保持不变的部分与可变的部分分开来优化公式。常数部分可以在循环开始之前计算出来。

你原来的方程式是这样的：

分离常量部分我们可以得出：

如果我们将常量替换为 twothirds = 2.0/3.0 和 xover3 = x/3 的标识符，那么我们最终会得到一个简化的等式，如下所示：

如果我们将其转换为 POSTFIX/RPN，我们会得到：

twothirds root * xover3 root root * / +

彼得在更好的循环体部分下的回答中利用了类似的优化。他将常量Twothirds 和Xover3 放在循环外的x87 FPU 堆栈上，并在循环内根据需要引用它们。这避免了每次循环都必须从内存中不必要地重新读取它们。

基于上述优化的更完整示例：

.586
.MODEL FLAT
.STACK 4096

.DATA
xx        REAL4   27.0
root      REAL4   1.0
Three     REAL4   3.0
epsilon   REAL4   0.001
Twothirds REAL4 0.6666666666666666

.CODE
main PROC
    finit               ; Initialize FPU
    fld epsilon         ; st(0)=epsilon
    fld root            ; st(0)=prevroot (Copy of root), st(1)=epsilon
    fld TwoThirds       ; st(0)=(2/3), st(1)=prevroot, st(2)=epsilon 
    fld xx              ; st(0)=x, st(1)=(2/3), st(2)=prevroot, st(3)=epsilon
    fdiv Three          ; st(0)=(x/3), st(1)=(2/3), st(2)=prevroot, st(3)=epsilon
    fld st(2)           ; st(0)=root, st(1)=(x/3), st(2)=(2/3), st(3)=prevroot, st(4)=epsilon

repeatAgain:

    ; twothirds root * xover3 root root * / +
    fld st(0)           ; st(0)=root, st(1)=root, st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    fmul st(0), st(3)   ; st(0)=(2/3*root), st(1)=root, st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon           
    fxch                ; st(0)=root, st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    fmul st(0), st(0)   ; st(0)=(root*root), st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    fdivr st(0), st(2)  ; st(0)=((x/3)/(root*root)), st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    faddp               ; st(0)=((2/3*root)+(x/3)/(root*root)), st(1)=(x/3), st(2)=(2/3), st(3)=prevroot, st(4)=epsilon
    fxch st(3)          ; st(0)=prevroot, st(1)=(x/3), st(2)=(2/3), newroot=st(3)=((2/3*root)+(x/3)/(root*root)), st(4)=epsilon 
    fsub st(0), st(3)   ; st(0)=(prevroot-newroot), st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
    fabs                ; st(0)=(|prevroot-newroot|), st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
    fld st(4)           ; st(0)=0.001, st(1)=(|prevroot-newroot|), st(2)=(x/3), st(3)=(2/3), st(4)=newroot, st(5)=epsilon

    fcompp              ; Do compare&set x87 flags pop top two values off stack
                        ;     st(0)=(x/3), st(1)=(2/3), st(2)=newroot, st(3)=epsilon    
    fstsw ax            ; Copy x87 Status Word containing the result to AX
    fwait               ; Insure previous instruction completed
    sahf                ; Transfer condition codes to the CPU's flags register

    fld st(2)           ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
    jbe repeatAgain     ; If 0.001 <= (|oldroot-newroot|) repeat

    ; Remove temporary values on stack, cubed root in st(0)
    ffree st(4)         ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3), st(3)=epsilon
    ffree st(3)         ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3)
    ffree st(2)         ; st(0)=newroot, st(1)=(x/3)
    ffree st(1)         ; st(0)=newroot

    mov     eax, 0  ; exit
    ret
main ENDP 

END

此代码在进入循环之前将这些值放在堆栈中：

st(4) = Epsilon 值 (0.001)
st(3) = 计算完成前root 的副本（实际上是prevroot）
st(2) = 常数Twothirds (2/3)
st(1) = Xover3 (x/3)
st(0) = root 的活动副本

在循环重复之前，堆栈将具有上面的布局。

退出前最后的代码会删除所有临时值，并在 st(0) 中将堆栈的值 root 留在顶部。

【讨论】：

fmulP 不适用于内存操作数。 fmulp 弹出两个堆栈寄存器并推送结果。 fmul 弹出 st(0) 并推送结果。
@PeterCordes 大声笑，我在编辑问题时暂时删除了问题。我看到你一定在我删除之前发表了评论。我撕掉了（因为我是从头顶上打出来的，然后意识到由于你给出的原因需要一些修改）。
我选择保留简单的简化形式，直到我想出一个好方法来真正向外行解释它。有时少即是多，我会采取更安全的方法，尽可能保持基本。
1#IND 大约发生在 st(0) 中的第 6 个循环
哇，你真的被这个 x87 教程弄疯了。直到看到你的例子，我才明白 ffree 的用途。有趣的是，被释放的寄存器在循环堆栈中保留了它的位置。释放堆栈中间的一个寄存器后，反复运行fld1，可以得到1,1,nan,nan,1,nan,nan（按st(0)的顺序得到结果，即得到的堆栈在倒序）。