将两个 16 位数字相乘并将 32 位答案存储在 dx:ax 中，而无需在汇编 8086 中使用 mul 指令答案

【问题标题】：Multiply two 16 bit numbers and store 32 bit answer in dx:ax without mul instruction in assembly 8086将两个 16 位数字相乘并将 32 位答案存储在 dx:ax 中，而无需在汇编 8086 中使用 mul 指令
【发布时间】：2016-01-24 01:26:40
【问题描述】：

我正在尝试使用汇编语言中的 shift 和 add 方法将两个 16 位数字相乘，并将 hi 部分存储在 dx 寄存器中，将低部分存储在 ax 寄存器中。被乘数和乘数在堆栈上传递对于我的一些测试，我可以得到正确的答案，但对于一些，持有较高部分的部分，dx，是错误的。例如，如果我做 0002 次 0001 我得到了我的答案，dx = 0002 ax = 0002，而答案应该是 dx = 0000 ax = 0002。

这是我的代码。我似乎不知道我的代码哪里出错了。我什至手工做了这个例子，但看不到 dx = 0002 部分是如何到达那里的。

;---------------------------------------
; Multiply data
;---------------------------------------

h         dw        0                   ; this holds the high order bits

mltplier  dw        0                   ; this holds the mulitplier

     .code
;---------------------------------------
; Multiply code
;---------------------------------------
_multiply:                             ;
     push      bp                  ; save bp
     mov       bp,sp               ; anchor bp into the stack
     mov       bx,[bp+4]           ; load multiplicand from the stack
     mov       cx,[bp+6]           ; load multiplier   from the stack
     mov       [mltplier],cx       ;
     mov       cx,0Fh              ; make counter of 16
     mov       ax,0                ;
     mov       dx,0                ;

;  calculate multiplicand * multiplier
;  return result in dx:ax
_loop:
     shr       [mltplier],1        ; shift right by 1
     jnc       shift               ; if the number shifted out was not a 1           
                                   ;then we don't need to add anything
     clc                           ;clear carry flag
     add       ax,bx               ; add bx to ax, the low bits
     add       dx,[h]              ; add var to dx, the high bits
shift:                                 ;
     shl       [h],1               ; shift the high order bits left
     shl       bx,1                ; shift the low order bits left
     adc       [h],0               ; add to the high bits
     clc                           ;clear carry flag
    loop       _loop               ; loop the process
     pop       bp                  ; restore bp
     ret                           ; return with result in dx:ax
                                   ;
     end                           ; end source code
;---------------------------------------

【问题讨论】：

您有在计算机上运行的实现吗？还是只是“纸上谈兵”？如果您有一个正在运行的实现，那么使用调试器来单步调试您的程序将大有帮助。
mov cx,0Fh 使 loop 计数器 15 不是 16。
即使是 16，我也会得到 dx = 0004。是的，我确实在计算机上运行了这段代码，这就是我发现我的测试用例得到不同答案的原因
不是每本 MCU 书籍都包含此类事情的示例吗？
您正在向右移动乘数并测试它的 ls。少量。您需要将其向左移动并测试其 m.s.位，因为您将产品向左移动。您还需要在添加之前移动产品。

标签： assembly 32-bit multiplication x86-16 16-bit

【解决方案1】：

这显示了如何将两个 16 位值相乘以得到一个 32 位值（在两个 16 位寄存器中）。

#include <stdio.h>

unsigned multiply16x16(unsigned short m, unsigned short n) {
    __asm {
        xor     ax,ax       ; clear the product
        xor     dx,dx
        mov     cx,16       ; set up loop counter
    nextbit:
        shl     ax,1        ; shift 32-bit product left
        adc     dx,dx
        shl     [m],1       ; get m.s. bit of multiplier
        jnc     noadd       ; ignore if not set
        add     ax,[n]      ; add multiplicand to product
        adc     dx,0        ; with carry
    noadd:
        loop    nextbit     ; loop counter stops when cx  0
        mov     [m],ax      ; store in 16-bit operands
        mov     [n],dx
    }
    return (n << 16) + m;   // align and return as 32-bit unsigned
}

int main(void){
    unsigned short m, n;

    m=3; n=5;
    printf("%u\n", multiply16x16 (m,n));

    m=65535; n=2;
    printf("%u\n", multiply16x16 (m,n));

    m=987; n=654;
    printf("%u\n", multiply16x16 (m,n));

    m=123; n=456;
    printf("%u\n", multiply16x16 (m,n));

    m=65535; n=65535;
    printf("%u\n", multiply16x16 (m,n));

    return 0;
}

程序输出：

【讨论】：

【解决方案2】：

WeatherVane 的评论可能有解决错误答案的方法。

关于效率的一些说明：

通过与自身进行异或来将寄存器归零。它比mov r, 0 和is better in every way 占用更少的指令字节。（首选 XOR 而不是 sub same,same 或其他选项，因为更多 CPU 将 xor same,same 识别为独立于旧值。）
在jnc 之后不需要clc。 clc 只有在进位已被清除时才能访问。在loop 指令之前的clc 也没有用，因为您在下一个adc 之前运行了设置或清除CF 的其他指令。
将变量保存在内存中很慢。将mltplier 保留在si 或di 中，而不是shr [mltplier],1。（如果您可以在循环中使用寄存器而不是内存位置，则推送/弹出以保存/恢复整个函数调用一次的寄存器是值得的）。同样，将[h] 也保存在寄存器中。

如果您需要溢出到内存，通常更喜欢堆栈，而不是全局变量，因此您的函数是可重入和线程安全的。特别是。对于mltplier，您可以只使用调用者放入堆栈的值，而不是复制它。
loop is slow on modern x86 CPUs，与 dec cx / jne 相比。例如大约是 Haswell 上循环开销的 7 倍。您可以保存一个寄存器并通过在乘数上循环来加速循环！= 0，而不是总是在循环中执行 16 次。然后您可以将mult 保留在cx 中，并使用test cx, cx / jne 循环。
在注册中使用h（例如di）：

 shl       [h],1               ; shift the high order bits left
 shl       bx,1                ; shift the low order bits left
 adc       [h],0               ; add to the high bits

可能是：

 shl       bx, 1               ; shift the low order bits left
 adc       di, di              ; shift the high order bits left and add the carry

如果您的目标是 386 CPU，shld 双寄存器移位也可以工作，其优点是两条指令可以并行运行，而不是一个依赖于另一个：

 shld      di, bx, 1
 shl       bx, 1

shld r,r,i 在 Intel Sandybridge 系列 CPU 上比 adc 便宜。（1 uop 与 2）

查看Agner Fog's instruction tables and guides，以及x86 标签维基中的其他链接。

【讨论】：

【解决方案3】：

一种更有效、更有趣的方法是通过从有符号乘法 (imul) 中合成被禁止的无符号乘法 (mul) 来颠覆练习。

翻转无符号整数的 MSB，相当于以 2^16 为模减去 8000h，将值映射到无下溢的有符号整数范围内。因此允许计算(a-8000h)*(b-8000h)，并添加回a*8000h + b*8000h - 4000000h 产生a*b

multiply:
    push bp
    mov bp,sp
    mov ax,[bp+4]
    xor ax,8000h
    mov dx,[bp+6]
    xor dx,8000h
    imul dx
    sub dx,4000h
    mov cx,[bp+4]
    add cx,[bp+6]
    rcr cx,1    ;Recovery lost carry while dividing
    jnc @f      ;by two and adding back the a+b term
    add ax,8000h
@@: adc dx,cx
    pop bp
    ret

（为了记录，由于其长度，这更像是以答案的形式发布的评论。）

【讨论】：