短循环延迟答案

【问题标题】：Latency of short loop短循环延迟
【发布时间】：2016-08-18 20:11:36
【问题描述】：

我试图理解为什么一些简单的循环会以它们的速度运行

第一种情况：

L1:
    add rax, rcx  # (1)
    add rcx, 1    # (2)
    cmp rcx, 4096 # (3)
    jl L1

根据IACA，吞吐量是 1 个周期，瓶颈是端口 1,0,5。我不明白为什么它是 1 个 cylce。毕竟我们有两个循环携带的依赖：

(1) -> (1) ( Latancy is 1) 
(2) -> (2), (2) -> (1), (2) -> (3) (Latency is 1 + 1 + 1).

而且这种延迟是循环携带的，所以它应该会减慢我们的迭代速度。

Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles       Throughput Bottleneck: Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
-------------------------------------------------------------------------


| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 1.0       |     |           |           |     |     | CP | add rax, rcx
|   1    |           | 1.0 |           |           |     |     | CP | add rcx, 0x1
|   1    |           |     |           |           |     | 1.0 | CP | cmp rcx, 0x1000
|   0F   |           |     |           |           |     |     |    | jl 0xfffffffffffffff2
Total Num Of Uops: 3

第二种情况：

L1:    
    add rax, rcx
    add rcx, 1
    add rbx, rcx
    cmp rcx, 4096
    jl L1

Block Throughput: 1.65 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.4    0.0  | 1.4  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.3  |


| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.6       | 0.3 |           |           |     |     |    | add rax, rcx
|   1    | 0.3       | 0.6 |           |           |     |     | CP | add rcx, 0x1
|   1    | 0.3       | 0.3 |           |           |     | 0.3 | CP | add rbx, rcx
|   1    |           |     |           |           |     | 1.0 | CP | cmp rcx, 0x1000
|   0F   |           |     |           |           |     |     |    | jl 0xffffffffffffffef

越是不明白为什么吞吐量是1.65。

【问题讨论】：

您是否尝试过运行它并测量每个周期的指令？（将 4096 更改为巨大的东西）。我还没有分析完，但是你的 1+1+1 延迟显然是错误的：cmp 只是稍后的insn读取标志时依赖链的一部分。
IACA不是周期精确的。它总是一个近似值，除非在瓶颈很简单的简单情况下。它的 1.65 非常接近观察到的数字。哦，我刚刚注意到 1.65 来自一个不同的循环，带有一个额外的 add。
@PeterCordes 我编辑了。

标签： performance assembly optimization x86 micro-optimization

【解决方案1】：

在第一个循环中，有两个 dep 链，一个用于rax，一个用于rcx。

add rax, rcx  # depends on rax and rcx from the previous iteration, produces rax for the next iteration

add rcx, 1    # latency = 1

add rcx,1 -> add rax, rcx 的 2 周期延迟 dep 链跨越 2 次迭代（因此它已经有时间发生）而且它甚至都没有循环携带（因为 rax 不会反馈到add rcx,1)。

在任何给定的迭代中，只需要前一次迭代的结果即可产生本次迭代的结果。一次迭代中没有循环携带的依赖关系，只有在迭代之间。

就像我解释的 in answer to your question a couple days ago，cmp/jcc 不是循环携带的 dep 链的一部分。

如果cmov 或setcc 读取它生成的标志输出，cmp 只是 dep 链的一部分。控制依赖是预测的，而不是等待类似的数据依赖。

实际上，在我的 E6600（第一代 Core2，我目前没有可用的 SnB）上：

; Linux initializes most registers to zero on process startup, and I'm lazy so I depended on this for this one-off test.  In real code, I'd xor-zero ecx
    global _start
_start:
L1:
    add eax, ecx        ; (1)
    add ecx, 1          ; (2)
    cmp ecx, 0x80000000 ; (3)
    jb L1            ; can fuse with cmp on Core2 (in 32bit mode)

    mov eax, 1
    int 0x80

我将它移植到 32 位，因为 Core2 只能在 32 位模式下进行宏熔断，并且使用了jb，因为 Core2 只能宏熔断无符号分支条件。我使用了一个大的循环计数器，所以我不需要另一个循环。（IDK 为什么你选择像 4096 这样的小循环数。你确定你没有测量短循环之外的其他东西的额外开销吗？）

$ yasm -Worphan-labels -gdwarf2 -felf tinyloop.asm && ld -m elf_i386 -o tinyloop tinyloop.o
$ perf stat -e task-clock,cycles,instructions,branches ./tinyloop

Performance counter stats for './tinyloop':

    897.994122      task-clock (msec)         #    0.993 CPUs utilized          
 2,152,571,449      cycles                    #    2.397 GHz                    
 8,591,925,034      instructions              #    3.99  insns per cycle        
 2,147,844,593      branches                  # 2391.825 M/sec                  

   0.904020721 seconds time elapsed

所以它以每个周期 3.99 个 insns 运行，这意味着每个周期 ~ 一次迭代。

如果您的 Ivybridge 运行该确切代码的速度只有大约一半，我会感到惊讶。更新：根据聊天中的讨论，是的，看来 IVB 确实只获得 2.14 IPC。（每 1.87c 一次迭代）。 将 add rax, rcx 更改为 add rax, rbx 或其他内容以消除对上一次迭代中循环计数器的依赖使吞吐量达到 3.8 IPC（每 1.05c 一次迭代）。我不明白为什么会这样。

使用不依赖于宏融合的类似循环，(add / inc ecx / jnz) 我也每 1c 获得一次迭代。（每个周期 2.99 个insns）。

但是，循环中的第 4 个 insn 也读取 ecx 会大大降低速度。 Core2 每个时钟可以发出 4 个微指令，尽管（如 SnB/IvB）它只有三个 ALU 端口。（很多代码都包含内存微指令，所以这很有意义。）

add eax, ecx       ; changing this to add eax,ebx  helps when there are 4 non-fusing insns in the loop
; add edx, ecx     ; slows us down to 1.34 IPC, or one iter per 3c
; add edx, ebx     ; only slows us to 2.28 IPC, or one iter per 1.75c
                   ; with neither:    3    IPC, or one iter per 1c
inc ecx
jnz L1             # loops 2^32 times, doesn't macro-fuse on Core2

我预计仍以 3 IPC 运行，或每 4/3 = 1.333c 一个迭代。然而，pre-SnB CPU 有更多的瓶颈，例如 ROB 读取和寄存器读取瓶颈。 SnB 切换到物理寄存器文件消除了这些减速。

在您的第二个循环中，IDK 为什么它不以每 1.333c 一次迭代运行。更新rbx 的insn 只能在该迭代的其他指令之后运行，但这就是乱序执行的目的。您确定它与每 1.85 个周期进行一次迭代一样慢吗？您使用perf 进行测量以获得足够高的计数以获取有意义的数据？（rdtsc 循环计数不准确，除非您禁用涡轮和频率缩放，但性能计数器仍会计算实际核心循环）。

我不认为它与

有很大不同

L1:    
    add rax, rcx
    add rbx, rcx      # before/after inc rcx shouldn't matter because of out-of-order execution
    add rcx, 1
    cmp rcx, 4096
    jl L1

【讨论】：

谢谢 :)，在这里对您的回答发表评论：stackoverflow.com/questions/36739118/dependency-chain-analysis/… 您说存在循环携带的依赖项（我的意思是：“add 是循环携带的依赖项链”），现在你说没有。
@Gilgamesz：我的措辞令人困惑，抱歉。我的意思是cmp 对add 的依赖存在，但不是循环携带的依赖链的一部分。当然有循环携带的依赖链，但每次迭代中的依赖不属于它们。
对于我的第一个循环，我每次迭代得到 1.9 个 cylce（perf 和 Agner Fog 工具），但我不明白为什么是 1.9。我们有两个循环延迟循环携带add rax, rcx # depends on rax and rcx from the previous iteration, produces rax for the next iteration
您确定您的数字是每次迭代的周期数，而不是每个周期的指令？ perf 不能直接测量每个迭代的周期。但无论如何，no，在一次迭代和下一次迭代之间没有 2 周期延迟依赖链。 inc rcx -> add rax, rcx 的 2 周期延迟 dep 链跨越 2 次迭代（因此它已经有时间发生）并且它不是循环携带的（因为 rax 不会反馈到 inc rcx )。
我从perf 得到：192380778 cycles # 3,207 GHz 并将其除以迭代次数 100000000。也许我误解了一些东西，但感谢你，我正在取得进展；)