模运算符和按位与的性能比较答案

【问题标题】：Performance comparison of modulo operator and bitwise AND模运算符和按位与的性能比较
【发布时间】：2018-07-08 23:32:34
【问题描述】：

我正在确定一个 32 位整数是偶数还是奇数。我设置了两种方法：

模（%）方法

int r = (i % 2);

按位（&）方法

int r = (i & 0x1);

这两种方法都能成功。所以我每行运行 15000 次来测试性能。

结果：

模（%）方法（source code）

平均 141.5801887ns |标清 270.0700275ns

按位（&）方法（source code）

平均 141.2504ns | SD 193.6351007ns

问题：

为什么按位（&）比除（%）更稳定？

JVM 是否根据here 使用 AND(&) 优化模(%)？

【问题讨论】：

您的两个基准在我看来几乎相同，只是标准差略有不同。但是，您可能甚至没有以具有统计学意义的方式设置这两个测试。
第一个链接应该是github.com/kflau/modulo-division-vs-bitwise-benchmark/blob/…。
相关：How do I write a correct micro-benchmark in Java?
你完全错了。您不能使用 nanoTime() 来测量单个操作的执行，因为它通常具有数百纳秒甚至更多的粒度（取决于您的操作系统）。使用 JMH 并检查汇编代码。

标签： java performance bit-manipulation jmh

【解决方案1】：

让我们尝试用 JMH 重现。

@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
    return i % 2;
}

@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
    return i & 0x1;
}

好的，它是可重现的。 first 比 second 稍慢。现在让我们找出原因。使用-prof perfnorm 运行它：

Benchmark                                 Mode  Cnt   Score    Error  Units
MyBenchmark.first                         avgt   50   2.674 ±  0.028  ns/op
MyBenchmark.first:CPI                     avgt   10   0.301 ±  0.002   #/op
MyBenchmark.first:L1-dcache-load-misses   avgt   10   0.001 ±  0.001   #/op
MyBenchmark.first:L1-dcache-loads         avgt   10  11.011 ±  0.146   #/op
MyBenchmark.first:L1-dcache-stores        avgt   10   3.011 ±  0.034   #/op
MyBenchmark.first:L1-icache-load-misses   avgt   10  ≈ 10⁻³            #/op
MyBenchmark.first:LLC-load-misses         avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:LLC-loads               avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:LLC-store-misses        avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.first:LLC-stores              avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:branch-misses           avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:branches                avgt   10   4.006 ±  0.054   #/op
MyBenchmark.first:cycles                  avgt   10   9.322 ±  0.113   #/op
MyBenchmark.first:dTLB-load-misses        avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:dTLB-loads              avgt   10  10.939 ±  0.175   #/op
MyBenchmark.first:dTLB-store-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.first:dTLB-stores             avgt   10   2.991 ±  0.045   #/op
MyBenchmark.first:iTLB-load-misses        avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.first:iTLB-loads              avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:instructions            avgt   10  30.991 ±  0.427   #/op
MyBenchmark.second                        avgt   50   2.263 ±  0.015  ns/op
MyBenchmark.second:CPI                    avgt   10   0.320 ±  0.001   #/op
MyBenchmark.second:L1-dcache-load-misses  avgt   10   0.001 ±  0.001   #/op
MyBenchmark.second:L1-dcache-loads        avgt   10  11.045 ±  0.152   #/op
MyBenchmark.second:L1-dcache-stores       avgt   10   3.014 ±  0.032   #/op
MyBenchmark.second:L1-icache-load-misses  avgt   10  ≈ 10⁻³            #/op
MyBenchmark.second:LLC-load-misses        avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:LLC-loads              avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:LLC-store-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:LLC-stores             avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:branch-misses          avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:branches               avgt   10   4.014 ±  0.045   #/op
MyBenchmark.second:cycles                 avgt   10   8.024 ±  0.098   #/op
MyBenchmark.second:dTLB-load-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:dTLB-loads             avgt   10  10.989 ±  0.161   #/op
MyBenchmark.second:dTLB-store-misses      avgt   10  ≈ 10⁻⁶            #/op
MyBenchmark.second:dTLB-stores            avgt   10   3.004 ±  0.042   #/op
MyBenchmark.second:iTLB-load-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:iTLB-loads             avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:instructions           avgt   10  25.076 ±  0.296   #/op

注意周期和指令的区别。现在这很明显。 first 确实关心符号，但 second 不关心（只是按位与）。为确保这是原因，请查看程序集片段：

第一：

0x00007f91111f8355: mov     0xc(%r10),%r11d   ;*getfield i
0x00007f91111f8359: mov     %r11d,%edx
0x00007f91111f835c: and     $0x1,%edx
0x00007f91111f835f: mov     %edx,%r10d
0x00007f6bd120a6e2: neg     %r10d
0x00007f6bd120a6e5: test    %r11d,%r11d
0x00007f6bd120a6e8: cmovl   %r10d,%edx

秒：

0x00007ff36cbda580: mov     $0x1,%edx
0x00007ff36cbda585: mov     0x40(%rsp),%r10
0x00007ff36cbda58a: and     0xc(%r10),%edx

【讨论】：

我在这里遗漏了一个解释，比如“模数是一个非常慢的操作，但是 JVM 将 i % 2 优化为 i > 0 ? i & 1 : -(i & 1)。
@maaartinus 这是真的，如果 i 是 2 的幂。否则它会编译为一些更重的位操作（令人惊讶的是，不像我预期的那样简单 idiv）。

【解决方案2】：

150 ns 的执行时间大约是 500 个时钟周期。我认为从来没有一个处理器能够以这种低效的方式进行检查:-)。

问题在于您的测试工具在许多方面存在缺陷。特别是：

在开始计时之前不要尝试触发 JIT 编译
System.nanotime() 不保证具有纳秒精度
System.nanotime() 调用要测量的代码要贵很多

请参阅How do I write a correct micro-benchmark in Java?，了解更完整的注意事项列表。

这是一个更好的基准：

public abstract class Benchmark {

    final String name;

    public Benchmark(String name) {
        this.name = name;
    }

    @Override
    public String toString() {
        return name + "\t" + time() + " ns / iteration";
    }

    private BigDecimal time() {
        try {
            // automatically detect a reasonable iteration count (and trigger just in time compilation of the code under test)
            int iterations;
            long duration = 0;
            for (iterations = 1; iterations < 1_000_000_000 && duration < 1_000_000_000; iterations *= 2) {
                long start = System.nanoTime();
                run(iterations);
                duration = System.nanoTime() - start;
                cleanup();
            }
            return new BigDecimal((duration) * 1000 / iterations).movePointLeft(3);
        } catch (Throwable e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Executes the code under test.
     * @param iterations
     *            number of iterations to perform
     * @return any value that requires the entire code to be executed (to
     *         prevent dead code elimination by the just in time compiler)
     * @throws Throwable
     *             if the test could not complete successfully
     */
    protected abstract Object run(int iterations) throws Throwable;

    /**
     * Cleans up after a run, setting the stage for the next.
     */
    protected void cleanup() {
        // do nothing
    }

    public static void main(String[] args) throws Exception {
        System.out.println(new Benchmark("%") {
            @Override
            protected Object run(int iterations) throws Throwable {
                int sum = 0;
                for (int i = 0; i < iterations; i++) {
                    sum += i % 2;
                }
                return sum; 
            }
        });
        System.out.println(new Benchmark("&") {
            @Override
            protected Object run(int iterations) throws Throwable {
                int sum = 0;
                for (int i = 0; i < iterations; i++) {
                    sum += i & 1;
                }
                return sum;
            }
        });
    }
}

在我的机器上，它打印：

%   0.375 ns / iteration
&   0.139 ns / iteration

因此，正如预期的那样，差异大约是几个时钟周期。也就是说，& 1 在这个特定硬件上被这个 JIT 优化得稍微好一些，但是差别非常小，极不可能对你的程序的性能产生可衡量的（更不用说显着的）影响。

【讨论】：

【解决方案3】：

这两个操作对应不同的JVM处理器指令：

irem     // int remainder (%)
iand     // bitwise and (&)

我读到的某处irem 通常由JVM 实现，而iand 在硬件上可用。 Oracle对两条指令的解释如下：

iand

通过取 value1 和 value2 的按位与（合取）来计算 int 结果。

irem

int 结果是 value1 - (value1 / value2) * value2。

在我看来，假设iand 会减少 CPU 周期似乎是合理的。

【讨论】：

大多数 CPU 也可以计算提醒，但是速度很慢而且 JVM optimizes it to AND if possible。
@maaartinus 什么时候进行优化？
实际上，所有优化都在运行时进行。 javac（或 ecj 或其他）不在乎。 JVM 可以。许多优化会导致代码膨胀，最好专注于热点（因此他们称之为their compiler）。此外，许多优化都是基于一些假设，这些假设可能会在以后失效，JVM 必须deoptimize。因此，您在字节码中看不到类似的内容。