自己调试问题的工作做得很好。由于我已经开始查看代码,因此我将在添加 cmets 时给您一些效率/风格批评:
%macro prologue 0
push rbp
mov rbp,rsp ; you can drop this and the LEAVE.
; Stack frames were useful before debuggers could keep track of things without them, and as a convenience
; so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss ; These should really be locals on the stack (or in regs!), not globals
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
; usually a comment with a C function prototype and description is a good idea for functions
global avgArray
avgArray:
prologue
mov [a1], rdi ; what is this sillyness? you have 16 registers for a reason.
mov [a2], rsi ; shuffling the values you want into the regs you want them in
mov [avg], rdx ; is best done with reg-reg moves.
mov [avgL], rcx ; I like to just put a comment at the top of a block of code
; to document what goes in what reg.
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; This could be lea rcx, [rsi+rcx]
; (since avgL is in rcx anyway as a function arg).
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop: ; you can use a local label here, starting with a .
; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
mov al, [rsi] ; there's a data dependency on the old value of ax
mov dl, [r9] ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm
add ax, dx ; Avoid ALU ops on 16bit regs whenever possible. (8bit is fine, they have diff opcodes instead of a prefix)
; to avoid decode stalls on Intel
shr ax, 1 ; Better to use 32bit regs (movsx/movzx)
mov [rdi], al
add rsi, [offset] ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
您有大量可用的寄存器,为什么要将循环增量保留在内存中?我希望它只是在调试/尝试东西时以这种方式结束。
另外,1-reg addressing modes are only more efficient when used as mem operands for ALU ops。当你有很多指针(除非你展开循环)时,只需增加一个计数器并使用 base+offset*scale 寻址,尤其是。如果您使用mov 加载它们。
这是我的做法(对英特尔 SnB 及更高版本进行性能分析):
标量
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; if you can choose your prototype, do it so args go where you want them anyway.
; prologue
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; mov [rsp-8], rcx ; if I wanted to spill len to memory
add rcx, rdi
add rcx, rsi
add rcx, rdx
neg rcx ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
; We could also have just counted down towards zero
; but HW memory prefetchers have more stream slots for forward patterns than reverse.
ALIGN 16
.loop:
; use movsx for signed char
movzx eax, [rsi+rcx] ; dependency-breaker
movzx r8d, [rdx+rcx] ; Using r8d to save push/pop of rbx
; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
; using ebx or ebp would save a REX prefix (1 insn byte).
add eax, r8d
shr eax, 1
mov [rdi+rcx], al
inc rcx ; No cmp needed: this is the point of counting up towards zero
jl .loop ; inc/jl can Macro-fuse into one uop
; nothing to pop, we only used caller-saved regs.
ret
在 Intel 上,循环是 7 个 uops,(存储是 2 个 uops:存储地址和存储数据,不能微熔断),所以每个周期可以发出 4 个 uops 的 CPU 会在 2 个周期内完成每个字节。 movzx(对于 32 或 64 位 reg)无论如何都是 1 uop,因为没有端口 0/1/5 uop 可用于微熔或不熔断。 (这是读取,而不是读取修改)。
7 微指令占用 2 块最多 4 微指令,因此循环可以在 2 个周期内发出。没有其他瓶颈会阻止执行单元跟上它,因此它应该每 2 个周期运行一个。
矢量
有一个向量指令可以完全执行此操作:PAVGB 是无符号字节的压缩平均值(带有 9 位临时以避免溢出,与您的 add/shr 相同)。
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
; use movsx for signed char
movdqu xmm0, [rsi+rcx] ; 1 uop
pavgb xmm0, [rdx+rcx] ; 2 uops (no micro-fusion)
movdqu [rdi+rcx], xmm0 ; 2 uops: no micro-fusion
add rcx, 16
jl .loop ; 1 macro-fused uop add/branch
; TODO: scalar cleanup.
ret
正确设置循环退出条件很棘手,因为如果下一个 16B 超出数组末尾,您需要结束向量循环。概率。最好在将 rcx 添加到指针之前将其减少 15 或其他值来处理它。
同样,每次迭代 6 uops / 2 个周期,但每次迭代将执行 16 个字节。展开是理想的,因此您的循环是 4 微指令的倍数,因此您不会在循环结束时以小于 4 微指令的周期丢失问题率。每个周期 2 次加载/1 次存储是我们的瓶颈,因为 PAVGB 每个周期的吞吐量为 2。
16B / 周期在 Haswell 及更高版本上应该不难。使用 ymm 寄存器的 AVX2,您将获得 32B / 周期。 (SnB/IvB 每个周期只能执行两次内存操作,最多一次存储,除非您使用 256b 加载/存储)。无论如何,在这一点上,你已经从矢量化中获得了 16 倍的巨大加速,通常这已经足够了。我只是喜欢通过计算微指令和展开来调整理论最大吞吐量。 :)
如果您要完全展开循环,那么增加指针而不是仅仅增加索引是值得的。 (因此,[rdx] 有两种用途和一种添加方式,而 [rdx+rcx] 有两种用途)。
无论哪种方式,清理循环设置并将所有内容都保存在寄存器中可以节省大量指令字节和短数组的开销。