【问题标题】:Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?汇编为什么是“lea eax, [eax + eax*const]; shl eax, eax, const;”根据 gcc -O2,组合速度比“imul eax、eax、const”快吗?
【发布时间】:2022-01-15 20:45:06
【问题描述】:

我正在使用 Godbolt 来组装以下程序:

#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
    res = a * 36;
    return 1;
}

如果我使用-Os优化,生成的代码自然:

mov     eax, DWORD PTR a[rip]
imul    eax, eax, 36
mov     DWORD PTR res[rip], eax

但是如果我使用-O2,生成的代码是这样的:

mov     eax, DWORD PTR a[rip]
lea     eax, [rax+rax*8]
sal     eax, 2
mov     DWORD PTR res[rip], eax

所以不是乘以 5*36,而是乘以 5 -> 5+5*8=45 -> 45*4 = 180。我认为这是因为 1 imul 比 1 lea + 1 左移慢。

但在lea指令中,需要计算rax+rax*8,其中包含1个加法+1个mul。那么为什么它仍然比 1 imul 快呢?是因为 lea 内部的内存寻址是免费的吗?

编辑 1: 还有,[rax + rax*8] 是如何翻译成机器码的?它会被编译成额外的 2 条指令 (shl, rbx, rax, 3; add rax, rax, rbx;),还是其他?

编辑 2: 令人惊讶的结果如下。我做了一个循环,然后使用 -O2 生成代码,然后复制文件并将上面的段替换为来自 -Os 的代码.因此,除了我们进行基准测试的说明外,2 个汇编文件在任何地方都是相同的。在 Windows 上运行,命令是

gcc mul.c -O2 -S -masm=intel -o mulo2.s 
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"

#include <stdio.h>

volatile int a = 5;
volatile int res = 0;

int main() {
    size_t LOOP = 1000 * 1000 * 1000;
    LOOP = LOOP * 10;
    size_t i = 0;
    while (i < LOOP) {
      i++;
      res = a * 36;
    }

    return 0;
}

; mulo2.s
    .file   "mul.c"
    .intel_syntax noprefix
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section    .text.startup,"x"
    .p2align 4
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    sub rsp, 40
    .seh_stackalloc 40
    .seh_endprologue
    call    __main
    movabs  rdx, 10000000000
    .p2align 4,,10
    .p2align 3
.L2:
    mov eax, DWORD PTR a[rip]
    lea eax, [rax+rax*8] ; replaces these 2 lines with
    sal eax, 2           ; imul eax, eax, 36
    mov DWORD PTR res[rip], eax
    sub rdx, 1
    jne .L2
    xor eax, eax
    add rsp, 40
    ret
    .seh_endproc
    .globl  res
    .bss
    .align 4
res:
    .space 4
    .globl  a
    .data
    .align 4
a:
    .long   5
    .ident  "GCC: (GNU) 9.3.0"

令人惊讶的是,结果是-Os 版本始终-O2 快(平均4.1s vs 5s,Intel 8750H CPU,每个.exe 文件运行多次)。所以在这种情况下,编译器优化错误。有人可以根据这个基准提供新的解释吗?

编辑 3: 为了测量指令缓存行的影响,这里有一个 python 脚本,通过在主循环之前向程序添加 nop 指令来为主循环生成不同的地址。 Window用的,Linux用的,稍微修改一下就可以了。

#cd "D:\Learning\temp"
import os
import time
import datetime as dt

f = open("mulo2.s","r")
lines = [line for line in f]
f.close()

def addNop(cnt, outputname):
    f = open(outputname, "w")
    for i in range(17):
        f.write(lines[i])
    for i in range(cnt):
        f.write("\tnop\n")
    for i in range(17, len(lines)):
        f.write(lines[i])
    f.close()

if os.path.isdir("nop_files")==False:
    os.mkdir("nop_files")
MAXN = 100
for t in range(MAXN+1):
    sourceFile = "nop_files\\mulo2_" + str(t) + ".s" # change \\ to / on Linux
    exeFile = "nop_files\\mulo2_" + str(t)
    if os.path.isfile(sourceFile)==False:
        addNop(t, sourceFile)
        os.system("gcc " + sourceFile + " -o " + exeFile)
    runtime = os.popen("timecmd " + exeFile).read() # use time
    print(str(t) + " nop: " + str(runtime))

Result:

0 nop: command took 0:0:4.96 (4.96s total)

1 nop: command took 0:0:4.94 (4.94s total)

2 nop: command took 0:0:4.90 (4.90s total)

3 nop: command took 0:0:4.90 (4.90s total)

4 nop: command took 0:0:5.26 (5.26s total)

5 nop: command took 0:0:4.94 (4.94s total)

6 nop: command took 0:0:4.92 (4.92s total)

7 nop: command took 0:0:4.98 (4.98s total)

8 nop: command took 0:0:5.02 (5.02s total)

9 nop: command took 0:0:4.97 (4.97s total)

10 nop: command took 0:0:5.12 (5.12s total)

11 nop: command took 0:0:5.01 (5.01s total)

12 nop: command took 0:0:5.01 (5.01s total)

13 nop: command took 0:0:5.07 (5.07s total)

14 nop: command took 0:0:5.08 (5.08s total)

15 nop: command took 0:0:5.07 (5.07s total)

16 nop: command took 0:0:5.09 (5.09s total)

17 nop: command took 0:0:7.96 (7.96s total) # slow 17

18 nop: command took 0:0:7.93 (7.93s total)

19 nop: command took 0:0:7.88 (7.88s total)

20 nop: command took 0:0:7.88 (7.88s total)

21 nop: command took 0:0:7.94 (7.94s total)

22 nop: command took 0:0:7.90 (7.90s total)

23 nop: command took 0:0:7.92 (7.92s total)

24 nop: command took 0:0:7.99 (7.99s total)

25 nop: command took 0:0:7.89 (7.89s total)

26 nop: command took 0:0:7.88 (7.88s total)

27 nop: command took 0:0:7.88 (7.88s total)

28 nop: command took 0:0:7.84 (7.84s total)

29 nop: command took 0:0:7.84 (7.84s total)

30 nop: command took 0:0:7.88 (7.88s total)

31 nop: command took 0:0:7.91 (7.91s total)

32 nop: command took 0:0:7.89 (7.89s total)

33 nop: command took 0:0:7.88 (7.88s total)

34 nop: command took 0:0:7.94 (7.94s total)

35 nop: command took 0:0:7.81 (7.81s total)

36 nop: command took 0:0:7.89 (7.89s total)

37 nop: command took 0:0:7.90 (7.90s total)

38 nop: command took 0:0:7.92 (7.92s total)

39 nop: command took 0:0:7.83 (7.83s total)

40 nop: command took 0:0:4.95 (4.95s total) # fast 40

41 nop: command took 0:0:4.91 (4.91s total)

42 nop: command took 0:0:4.97 (4.97s total)

43 nop: command took 0:0:4.97 (4.97s total)

44 nop: command took 0:0:4.97 (4.97s total)

45 nop: command took 0:0:5.11 (5.11s total)

46 nop: command took 0:0:5.13 (5.13s total)

47 nop: command took 0:0:5.01 (5.01s total)

48 nop: command took 0:0:5.01 (5.01s total)

49 nop: command took 0:0:4.97 (4.97s total)

50 nop: command took 0:0:5.03 (5.03s total)

51 nop: command took 0:0:5.32 (5.32s total)

52 nop: command took 0:0:4.95 (4.95s total)

53 nop: command took 0:0:4.97 (4.97s total)

54 nop: command took 0:0:4.94 (4.94s total)

55 nop: command took 0:0:4.99 (4.99s total)

56 nop: command took 0:0:4.99 (4.99s total)

57 nop: command took 0:0:5.04 (5.04s total)

58 nop: command took 0:0:4.97 (4.97s total)

59 nop: command took 0:0:4.97 (4.97s total)

60 nop: command took 0:0:4.95 (4.95s total)

61 nop: command took 0:0:4.99 (4.99s total)

62 nop: command took 0:0:4.94 (4.94s total)

63 nop: command took 0:0:4.94 (4.94s total)

64 nop: command took 0:0:4.92 (4.92s total)

65 nop: command took 0:0:4.91 (4.91s total)

66 nop: command took 0:0:4.98 (4.98s total)

67 nop: command took 0:0:4.93 (4.93s total)

68 nop: command took 0:0:4.95 (4.95s total)

69 nop: command took 0:0:4.92 (4.92s total)

70 nop: command took 0:0:4.93 (4.93s total)

71 nop: command took 0:0:4.97 (4.97s total)

72 nop: command took 0:0:4.93 (4.93s total)

73 nop: command took 0:0:4.94 (4.94s total)

74 nop: command took 0:0:4.96 (4.96s total)

75 nop: command took 0:0:4.91 (4.91s total)

76 nop: command took 0:0:4.92 (4.92s total)

77 nop: command took 0:0:4.91 (4.91s total)

78 nop: command took 0:0:5.03 (5.03s total)

79 nop: command took 0:0:4.96 (4.96s total)

80 nop: command took 0:0:5.20 (5.20s total)

81 nop: command took 0:0:7.93 (7.93s total) # slow 81

82 nop: command took 0:0:7.88 (7.88s total)

83 nop: command took 0:0:7.85 (7.85s total)

84 nop: command took 0:0:7.91 (7.91s total)

85 nop: command took 0:0:7.93 (7.93s total)

86 nop: command took 0:0:8.06 (8.06s total)

87 nop: command took 0:0:8.03 (8.03s total)

88 nop: command took 0:0:7.85 (7.85s total)

89 nop: command took 0:0:7.88 (7.88s total)

90 nop: command took 0:0:7.91 (7.91s total)

91 nop: command took 0:0:7.86 (7.86s total)

92 nop: command took 0:0:7.99 (7.99s total)

93 nop: command took 0:0:7.86 (7.86s total)

94 nop: command took 0:0:7.91 (7.91s total)

95 nop: command took 0:0:8.12 (8.12s total)

96 nop: command took 0:0:7.88 (7.88s total)

97 nop: command took 0:0:7.81 (7.81s total)

98 nop: command took 0:0:7.88 (7.88s total)

99 nop: command took 0:0:7.85 (7.85s total)

100 nop: command took 0:0:7.90 (7.90s total)

101 nop: command took 0:0:7.93 (7.93s total)

102 nop: command took 0:0:7.85 (7.85s total)

103 nop: command took 0:0:7.88 (7.88s total)

104 nop: command took 0:0:5.00 (5.00s total) # fast 104

105 nop: command took 0:0:5.03 (5.03s total)

106 nop: command took 0:0:4.97 (4.97s total)

107 nop: command took 0:0:5.06 (5.06s total)

108 nop: command took 0:0:5.01 (5.01s total)

109 nop: command took 0:0:5.00 (5.00s total)

110 nop: command took 0:0:4.95 (4.95s total)

111 nop: command took 0:0:4.91 (4.91s total)

112 nop: command took 0:0:4.94 (4.94s total)

113 nop: command took 0:0:4.93 (4.93s total)

114 nop: command took 0:0:4.92 (4.92s total)

115 nop: command took 0:0:4.92 (4.92s total)

116 nop: command took 0:0:4.92 (4.92s total)

117 nop: command took 0:0:5.13 (5.13s total)

118 nop: command took 0:0:4.94 (4.94s total)

119 nop: command took 0:0:4.97 (4.97s total)

120 nop: command took 0:0:5.14 (5.14s total)

121 nop: command took 0:0:4.94 (4.94s total)

122 nop: command took 0:0:5.17 (5.17s total)

123 nop: command took 0:0:4.95 (4.95s total)

124 nop: command took 0:0:4.97 (4.97s total)

125 nop: command took 0:0:4.99 (4.99s total)

126 nop: command took 0:0:5.20 (5.20s total)

127 nop: command took 0:0:5.23 (5.23s total)

128 nop: command took 0:0:5.19 (5.19s total)

129 nop: command took 0:0:5.21 (5.21s total)

130 nop: command took 0:0:5.33 (5.33s total)

131 nop: command took 0:0:4.92 (4.92s total)

132 nop: command took 0:0:5.02 (5.02s total)

133 nop: command took 0:0:4.90 (4.90s total)

134 nop: command took 0:0:4.93 (4.93s total)

135 nop: command took 0:0:4.99 (4.99s total)

136 nop: command took 0:0:5.08 (5.08s total)

137 nop: command took 0:0:5.02 (5.02s total)

138 nop: command took 0:0:5.15 (5.15s total)

139 nop: command took 0:0:5.07 (5.07s total)

140 nop: command took 0:0:5.03 (5.03s total)

141 nop: command took 0:0:4.94 (4.94s total)

142 nop: command took 0:0:4.92 (4.92s total)

143 nop: command took 0:0:4.96 (4.96s total)

144 nop: command took 0:0:4.92 (4.92s total)

145 nop: command took 0:0:7.86 (7.86s total) # slow 145

146 nop: command took 0:0:7.87 (7.87s total)

147 nop: command took 0:0:7.83 (7.83s total)

148 nop: command took 0:0:7.83 (7.83s total)

149 nop: command took 0:0:7.84 (7.84s total)

150 nop: command took 0:0:7.87 (7.87s total)

151 nop: command took 0:0:7.84 (7.84s total)

152 nop: command took 0:0:7.88 (7.88s total)

153 nop: command took 0:0:7.87 (7.87s total)

154 nop: command took 0:0:7.83 (7.83s total)

155 nop: command took 0:0:7.85 (7.85s total)

156 nop: command took 0:0:7.91 (7.91s total)

157 nop: command took 0:0:8.18 (8.18s total)

158 nop: command took 0:0:7.94 (7.94s total)

159 nop: command took 0:0:7.92 (7.92s total)

160 nop: command took 0:0:7.92 (7.92s total)

161 nop: command took 0:0:7.97 (7.97s total)

162 nop: command took 0:0:8.12 (8.12s total)

163 nop: command took 0:0:7.89 (7.89s total)

164 nop: command took 0:0:7.92 (7.92s total)

165 nop: command took 0:0:7.88 (7.88s total)

166 nop: command took 0:0:7.80 (7.80s total)

167 nop: command took 0:0:7.82 (7.82s total)

168 nop: command took 0:0:4.97 (4.97s total) # fast

169 nop: command took 0:0:4.97 (4.97s total)

170 nop: command took 0:0:4.95 (4.95s total)

171 nop: command took 0:0:5.00 (5.00s total)

172 nop: command took 0:0:4.95 (4.95s total)

173 nop: command took 0:0:4.93 (4.93s total)

174 nop: command took 0:0:4.91 (4.91s total)

175 nop: command took 0:0:4.92 (4.92s total)

程序从快到慢(然后从慢到快)切换的点是:17S-40F-81S-104F-145S-168F。我们可以看到slow->fast码的距离是23nop,fast->slow码的距离是41nop。我们查看objdump可以看到主循环占用了24个字节;这意味着如果我们将它放在缓存行的开头(address mod 64 == 0),插入 41 个字节将导致主循环跨越缓存行边界,从而导致速度变慢。所以在默认代码中(没有添加nop),主循环已经在同一个缓存行中。

所以我们知道-O2 版本变慢并不是因为指令地址对齐。 剩下的唯一罪魁祸首是指令解码速度 我们发现了一个新的罪魁祸首,就像@Jérôme Richard 的答案。

编辑 4: Skylake 每个周期解码 16 个字节。但是-Os-O2版本的大小分别是21和24,所以都需要2个周期来读取主循环。那么速度差异从何而来?

结论:虽然编译器在理论上是正确的(lea + sal 是 2 条超级便宜的指令,并且 lea 内部的寻址是免费的,因为它使用了单独的硬件电路),但实际上只有 1 条昂贵的指令 @ 987654343@ 可能会更快,因为 CPU 架构的一些极其复杂的细节,包括指令解码速度、微操作 (uop) 数量和 CPU 端口。

【问题讨论】:

  • 乘以 8 就是左移三位。
  • 顺便说一句,您是否尝试在数十亿次 main() 调用中对此进行基准测试? (或例如将 main() 重命名为 f())以防万一……
  • 将 'main' 重命名为 'f' (内联函数或只是循环)并在新的 main() 中调用 f() 十亿次。现在用 Os 生成一个 exec,另一个用 O2 生成一个 exec,虽然不太准确,但是一个简单的测试是 (Linux) time firstone, time secondone
  • 我认为乘法器比电路中的加法器复杂得多。 lea 中的因子是 1、2、4、8 之一,所以我猜它是硬连线的。 lea 也不会设置 FLAGS 寄存器,而 imul 会。
  • [rax + rax*8] 被翻译成机器代码作为“复杂的内存地址”,即它的确切编写方式,而不是拆分为额外的指令。相关:x64 instruction encoding and the ModRM byte

标签: c assembly optimization x86-64 cpu-architecture


【解决方案1】:

您可以在大多数主流架构herethere 上查看指令成本。基于此并假设您使用例如英特尔 Skylake 处理器,您可以看到每个周期可以计算一个 32 位 imul 指令,但延迟为 3 个周期。在优化后的代码中,每个周期可以执行 2 个lea 指令(非常便宜),延迟为 1 个周期。 sal 指令也是如此(每个周期 2 个,延迟周期 1 个)。

这意味着优化版本只需 2 个延迟周期即可执行,而第一个需要 3 个延迟周期(不考虑相同的加载/存储指令)。此外,由于 超标量乱序执行,第二个版本可以更好地流水线化,因为两条指令可以针对两个不同的输入数据并行执行。请注意,两个加载也可以并行执行,尽管每个周期只能并行执行一个存储。这意味着执行受限于存储指令的吞吐量。总体而言,每个周期只能计算 1 个值。 AFAIK,最近的 Intel Icelake 处理器可以像新的 AMD Ryzen 处理器一样并行执行两个存储。第二个预计在所选用例(英特尔 Skylake 处理器)上同样快或可能更快。在最近的 x86-64 处理器上,它应该会明显更快。

请注意,lea 指令非常快,因为乘加是在专用 CPU 单元(硬连线移位器)上完成的,并且它只支持一些 特定常量 用于乘法(支持因子是 1、2、4 和 8,这意味着 lea 可用于将整数乘以常数 2、3、4、5、8 和 9)。这就是为什么leaimul/mul 快的原因。


更新(v2):

我可以使用 GCC 11.2(在具有 i5-9600KF 处理器的 Linux 上)重现 -O2 较慢的执行。

减速的主要来源是-O2版本中要执行的micro-operations(微指令)数量较多当然与某些执行端口的饱和有关肯定是因为微操作调度不好

这里是-Os的循环汇编:

    1049:   8b 15 d9 2f 00 00       mov    edx,DWORD PTR [rip+0x2fd9]        # 4028 <a>
    104f:   6b d2 24                imul   edx,edx,0x24
    1052:   89 15 d8 2f 00 00       mov    DWORD PTR [rip+0x2fd8],edx        # 4030 <res>
    1058:   48 ff c8                dec    rax
    105b:   75 ec                   jne    1049 <main+0x9>

这里是-O2的循环汇编:

    1050:   8b 05 d2 2f 00 00       mov    eax,DWORD PTR [rip+0x2fd2]        # 4028 <a>
    1056:   8d 04 c0                lea    eax,[rax+rax*8]
    1059:   c1 e0 02                shl    eax,0x2
    105c:   89 05 ce 2f 00 00       mov    DWORD PTR [rip+0x2fce],eax        # 4030 <res>
    1062:   48 83 ea 01             sub    rdx,0x1
    1066:   75 e8                   jne    1050 <main+0x10>

现代 x86-64 处理器,解码(可变大小)指令,然后将它们转换为(更简单的固定大小)微操作最终执行(通常并行) 在几个执行端口上。有关特定 Skylake 架构的更多信息,请访问 here。 Skylake 可以将macro-fuse 多条指令合并为一个微操作。在这种情况下,dec+jnesub+jne 指令在每种情况下都融合到一个微指令中。这意味着-Os 版本执行 4 微指令/迭代,而-O2 执行 5 微指令/迭代。

uop 存储在一个称为解码流缓冲区 (DSB) 的 uop-cache 中,因此处理器不需要再次解码/翻译(小)循环的指令。要执行的缓存微指令在称为指令解码队列 (IDQ) 的队列中发送。最多可以从 DSB 向 IDQ 发送 6 个微指令/周期。对于-Os 版本,每个周期只有 4 uop 的 DSB 被发送到 IDQ(可能是因为循环受饱和的存储端口限制)。对于-O2 版本,每个周期仅向 IDQ 发送 5 微欧 DSB,但 5 次中有 4 次(平均)!这意味着每 4 个周期增加 1 个延迟周期,导致执行速度降低 25%。造成这种影响的原因尚不清楚,似乎与 uops 调度有关。

Uops 然后被发送到资源分配表 (RAT) 并发布 到预留站 (RS)。 RS 分派微指令到执行它们的端口。然后,微指令退休(即已提交)。从 DSB 间接传输到 RS 的微指令数量对于两个版本都是恒定的。相同数量的微指令被淘汰。但是,在两个版本中,RS 每个周期(并由端口执行)都会再调度 1 个 ghost uop。这可能是一个用于计算存储地址的微指令(因为存储端口没有自己的专用 AGU)。

这是从硬件计数器收集的每次迭代的统计信息(使用perf):

version | instruction | issued-uops | executed-uops | retired-uops | cycles
"-Os"   |      5      |      4      |        5      |       4      |  1.00
"-O2"   |      6      |      5      |        6      |       5      |  1.25

这里是整体端口利用率的统计数据:

 port  |   type      |  "-Os"  |   "-O2"
-----------------------------------------
    0  | ALU/BR      |     0%  |    60%
    1  | ALU/MUL/LEA |   100%  |    38%
    2  | LOAD/AGU    |    65%  |    60%
    3  | LOAD/AGU    |    73%  |    60%
    4  | STORE       |   100%  |    80%
    5  | ALU/LEA     |     0%  |    42%
    6  | ALU/BR      |   100%  |   100%
    7  | AGU         |    62%  |    40%
-----------------------------------------
 total |             |   500%  |   480%

端口 6 仅在 -O2 版本上完全饱和,这是出乎意料的,这当然解释了为什么每 5 个周期需要一个额外的周期。请注意,只有与指令 shlsub+jne 关联的微指令(同时)使用端口 0 和 6(没有其他端口)。

请注意,由于停顿周期,总共 480% 是调度工件。实际上,6*4=24 uops 应该每 5 个周期执行一次 (24/5*100=480)。另请注意,5 个周期中的 1 个不需要存储端口(平均每 5 个周期执行 4 次迭代,因此 4 个存储 uop),因此它的使用率为 80%。


相关:

【讨论】:

  • 好的,虽然生成的代码不完全等价,但我可以重现该问题。我澄清了关于 store 指令的问题,指出执行受到商店的限制,因此您不应该看到与 -O2 的显着性能差异。话虽如此,我没想到这会变慢。我认为这是由于指令的解码。因此,答案会因此而复杂一些;)。
  • 哇,这太深入了,我很少关注指令缓存,从不关心指令解码吞吐量。
  • 所以还有一件事要做:你能尝试在 -O2 版本中添加一些指令,以便主循环包含在同一个缓存行中吗?然后再次进行基准测试。还有,你用什么软件来查看指令的地址?
  • 我刚刚添加了一个脚本来生成指令地址的所有可能对齐方式。它表明在默认情况下,主循环位于同一缓存行内,与您评论的不同。你能为未来的读者更新答案吗?无论如何,我想剩下的唯一可能的答案是 CPU 指令解码速度
  • 请注意,x86 寻址模式将比例因子编码为 2 位移位计数。所以它不仅仅是“硬连线乘法”,它是汇编时转换为移位计数,这当然很便宜。 (只需要支持 4 个不同移位计数的桶形移位器甚至比有效支持 shl 等指令所需的完整桶形移位器还要简单。)因此,允许的比例因子是 2 的幂是非常重要的。(是的,是的,使用[same + same*scale],如果您不添加到另一个注册,您可以获得 2^n+1 缩放。)
【解决方案2】:

tl;dr:因为 LEA 不进行成熟的乘法运算。

虽然@JeromeRichard 的回答是正确的,但它的最后一句隐藏了真相的核心:使用 LEA,您只能乘以一个特定的常数,即 2 的幂。因此,它不需要一个大型的专用电路来进行乘法运算,而只需要一个小的子电路来将它的一个操作数移位一个固定的量。

【讨论】:

  • 您能否对我在编辑 2 中提供的代码进行基准测试?它显示 -Os 版本实际上运行得更快
  • @HuyLe:我认为您需要将您的第二次编辑分成自己的问题,因为您在问别的问题。将新问题链接到此问题。另外,请提供完整的示例,即两个汇编程序或两个 C 程序;很难准确理解你跑的是什么。
  • 但第二次编辑包含相同的指令。我只是对它们进行基准测试 10^10 次而不是 1 次?
  • @HuyLe:这是一个不同的问题。一个问题是关于一般的两个装配操作员——即使动机是给定的程序;另一个问题是关于特定程序的运行时。再说一次,我需要一个合适的MRE
  • 汇编代码来自-O2。您可以将“lea eax ...”行替换为“imul eax ...”以获取-Os 代码。基本上该程序在任何地方都是相同的,除了那 2 行。使用“gcc mul.s -o mul”得到一个可运行的程序
猜你喜欢
  • 2011-02-11
  • 2019-03-30
  • 2010-09-13
  • 1970-01-01
  • 1970-01-01
  • 2014-10-28
  • 2017-01-26
  • 1970-01-01
相关资源
最近更新 更多