NASM - 使用标签作为数组偏移量答案

【问题标题】：NASM - Using Labels as Array offsetsNASM - 使用标签作为数组偏移量
【发布时间】：2015-10-05 10:56:54
【问题描述】：

我正在尝试用汇编程序编写一个小程序，它将三个 char 数组作为输入，计算第一个数组中每个元素的平均值并将结果存储在第三个数组中，如下所示。

%macro prologue 0
    push    rbp
    mov     rbp,rsp
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
    offset  db  1
segment .bss
    a1      resq    1
    a2      resq    1
    avg     resq    1
    avgL    resd    1
segment .text
    global  avgArray 
avgArray:
    prologue

    mov [a1], rdi
    mov [a2], rsi
    mov [avg], rdx
    mov [avgL], rcx

    mov rsi, [a1]
    mov r9, [a2]
    mov rdi, [avg]

    mov rcx, rsi
    add rcx, [avgL]    ; array length

    xor rdx, rdx
    xor rax, rax
    xor rbx, rbx
avgArray_loop:
    mov al, [rsi]
    mov dl, [r9]
    add ax, dx
    shr ax, 1
    mov [rdi], al

    add rsi, [offset]
    add r9, [offset]
    add rdi, [offset]

    cmp rsi, rcx
    jb  avgArray_loop
    epilogue

当用1 替换[offset] 时，它工作得非常好。然而，当使用[offset] 来确定下一个数组元素时，它似乎不会将其值添加到rsi、rdi 和r9。我已经使用 gdb 进行了检查。调用add rsi, [offset]后，rsi中存储的地址还是一样的。

有人能告诉我为什么使用 [offset] 不起作用但添加一个简单的 1 可以吗？

顺便说一句：Linux x86_64 机器

【问题讨论】：

如果要将偏移量添加到 8 字节寄存器，为什么要声明为单字节？
据我所知，这没关系，因为它会将offset 添加到rdi 的低字节。但是，我尝试将 offset 声明为 QWORD，但这没有任何改变。另外我的假设是我只需要一个字节作为偏移量，因为它的值为 1，所以我不需要保留超过一个字节。
@muXXmit2X，您是否尝试过使用不带括号的标签：（例如add rsi, offset）
“据我所知，这并不重要，因为它会将offset 添加到rdi 的低字节。” 没有add r64,r/m8在指令集中（据我所知）。有add r/m64, imm8，但[offset] 不是即时的，而是r/m。
也有可能offset 是保留字，所以不妨试试别的。

标签： arrays assembly nasm x86-64

【解决方案1】：

所以我找到了解决这个问题的方法。

avgL 和 offset 的地址直接存储在彼此后面。从rcx 读取并将其存储到avgL 时，它也会覆盖offset 的值。将 avgL 声明为 QWORD 而不是 DWORD 可防止 mov 覆盖 offset 数据。

新的数据和bss段是这样的

segment .data
    offset  db  1
segment .bss
    a1      resq    1
    a2      resq    1
    avg     resq    1
    avgL    resq    1

【讨论】：

【解决方案2】：

自己调试问题的工作做得很好。由于我已经开始查看代码，因此我将在添加 cmets 时给您一些效率/风格批评：

%macro prologue 0
    push    rbp
    mov     rbp,rsp   ; you can drop this and the LEAVE.
;  Stack frames were useful before debuggers could keep track of things without them, and as a convenience
;  so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
    offset  db  1
segment .bss    ; These should really be locals on the stack (or in regs!), not globals
    a1      resq    1
    a2      resq    1
    avg     resq    1
    avgL    resd    1

segment .text
; usually a comment with a C function prototype and description is a good idea for functions
    global  avgArray
avgArray:
    prologue

    mov [a1], rdi     ; what is this sillyness?  you have 16 registers for a reason.
    mov [a2], rsi     ; shuffling the values you want into the regs you want them in
    mov [avg], rdx    ; is best done with reg-reg moves.
    mov [avgL], rcx   ; I like to just put a comment at the top of a block of code
                      ; to document what goes in what reg.

    mov rsi, [a1]
    mov r9, [a2]
    mov rdi, [avg]

    mov rcx, rsi
    add rcx, [avgL]    ; This could be lea rcx, [rsi+rcx]
              ;  (since avgL is in rcx anyway as a function arg).

    xor rdx, rdx
    xor rax, rax
    xor rbx, rbx
avgArray_loop:   ; you can use a local label here, starting with a .
 ; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
    mov al, [rsi]        ; there's a data dependency on the old value of ax
    mov dl, [r9]         ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm

    add ax, dx           ; Avoid ALU ops on 16bit regs whenever possible.  (8bit is fine, they have diff opcodes instead of a prefix)
                         ; to avoid decode stalls on Intel
    shr ax, 1            ; Better to use 32bit regs (movsx/movzx)
    mov [rdi], al

    add rsi, [offset]    ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
    add r9, [offset]
    add rdi, [offset]

    cmp rsi, rcx
    jb  avgArray_loop
    epilogue

您有大量可用的寄存器，为什么要将循环增量保留在内存中？我希望它只是在调试/尝试东西时以这种方式结束。

另外，1-reg addressing modes are only more efficient when used as mem operands for ALU ops。当你有很多指针（除非你展开循环）时，只需增加一个计数器并使用 base+offset*scale 寻址，尤其是。如果您使用mov 加载它们。

这是我的做法（对英特尔 SnB 及更高版本进行性能分析）：

标量

; no storage needed
segment .text
GLOBAL  avgArray
avgArray:
    ; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
    ; if you can choose your prototype, do it so args go where you want them anyway.
    ; prologue
    ; rdi = avg
    ; rsi = a1
    ; rdx = a2
    ; rcx = len

    ; mov    [rsp-8], rcx    ; if I wanted to spill  len  to memory

    add    rcx, rdi
    add    rcx, rsi
    add    rcx, rdx
    neg    rcx       ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
    ; We could also have just counted down towards zero
    ; but HW memory prefetchers have more stream slots for forward patterns than reverse.

ALIGN 16
.loop:
    ;  use movsx for signed char
    movzx  eax, [rsi+rcx]     ; dependency-breaker
    movzx  r8d, [rdx+rcx]     ; Using r8d to save push/pop of rbx
           ; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
           ; using ebx or ebp would save a REX prefix (1 insn byte).
    add    eax, r8d
    shr    eax, 1
    mov    [rdi+rcx], al

    inc    rcx     ; No cmp needed: this is the point of counting up towards zero
    jl     .loop   ; inc/jl can Macro-fuse into one uop

    ; nothing to pop, we only used caller-saved regs.
    ret

在 Intel 上，循环是 7 个 uops，（存储是 2 个 uops：存储地址和存储数据，不能微熔断），所以每个周期可以发出 4 个 uops 的 CPU 会在 2 个周期内完成每个字节。 movzx（对于 32 或 64 位 reg）无论如何都是 1 uop，因为没有端口 0/1/5 uop 可用于微熔或不熔断。（这是读取，而不是读取修改）。

7 微指令占用 2 块最多 4 微指令，因此循环可以在 2 个周期内发出。没有其他瓶颈会阻止执行单元跟上它，因此它应该每 2 个周期运行一个。

矢量

有一个向量指令可以完全执行此操作：PAVGB 是无符号字节的压缩平均值（带有 9 位临时以避免溢出，与您的 add/shr 相同）。

; no storage needed
segment .text
GLOBAL  avgArray
avgArray:
    ; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
    ; rdi = avg
    ; rsi = a1
    ; rdx = a2
    ; rcx = len

; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
    ;  use movsx for signed char
    movdqu    xmm0, [rsi+rcx]    ; 1 uop
    pavgb     xmm0, [rdx+rcx]    ; 2 uops (no micro-fusion)
    movdqu    [rdi+rcx], xmm0    ; 2 uops: no micro-fusion

    add    rcx, 16
    jl     .loop          ; 1 macro-fused uop add/branch
    ; TODO: scalar cleanup.
    ret

正确设置循环退出条件很棘手，因为如果下一个 16B 超出数组末尾，您需要结束向量循环。概率。最好在将 rcx 添加到指针之前将其减少 15 或其他值来处理它。

同样，每次迭代 6 uops / 2 个周期，但每次迭代将执行 16 个字节。展开是理想的，因此您的循环是 4 微指令的倍数，因此您不会在循环结束时以小于 4 微指令的周期丢失问题率。每个周期 2 次加载/1 次存储是我们的瓶颈，因为 PAVGB 每个周期的吞吐量为 2。

16B / 周期在 Haswell 及更高版本上应该不难。使用 ymm 寄存器的 AVX2，您将获得 32B / 周期。（SnB/IvB 每个周期只能执行两次内存操作，最多一次存储，除非您使用 256b 加载/存储）。无论如何，在这一点上，你已经从矢量化中获得了 16 倍的巨大加速，通常这已经足够了。我只是喜欢通过计算微指令和展开来调整理论最大吞吐量。 :)

如果您要完全展开循环，那么增加指针而不是仅仅增加索引是值得的。（因此，[rdx] 有两种用途和一种添加方式，而 [rdx+rcx] 有两种用途）。

无论哪种方式，清理循环设置并将所有内容都保存在寄存器中可以节省大量指令字节和短数组的开销。

【讨论】：

使用向量的技巧非常棒；）我想优化的算法现在效率提高了一倍。但我还有两个问题。对于使用矢量化，数组应该是 16 字节对齐的，对吗？如果我知道数组实际上是 32 字节对齐的，我不能展开循环来执行两个向量操作吗？
展开不会增加对齐要求。它增加了循环结束条件/清理代码的复杂性（除非您只对所有剩余部分进行标量处理）。因此，如果 length（减去未对齐）是每次迭代字节数的倍数，这很重要，因为这样就不需要运行清理循环。 32B 对齐对于 AVX2 很重要，以避免与其他所有访问交叉缓存行。
另请注意，我的 SSE 循环仅需要 rdx 对齐（或标量介绍运行直到对齐），因为其他循环使用 movdqu 访问。如果您正在编写一个在一般情况下工作的函数，您必须处理每个数组未对齐不同数量的情况。像 gcc 和 clang 这样的自动向量化编译器有时会针对不同的对齐情况发出不同版本的循环。
我试图展开循环，但并没有更快，所以我将数组对齐改回 16 字节，并且每个循环只调用一次 pavgb。不过感谢您的解释；）