并行流处理 vs 线程池处理 vs 顺序处理答案

【问题标题】：Parallel stream processing vs Thread pool processing Vs Sequential processing并行流处理 vs 线程池处理 vs 顺序处理
【发布时间】：2018-05-04 05:12:23
【问题描述】：

我只是在评估，哪个代码 sn-ps 在 java 8 中的性能更好。

片段 1（在主线程中处理）：

public long doSequence() {
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    ds.forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    });
    System.out.println("Sequence");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

片段 2（并行线程处理）：

public long doParallel() {
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    ds.parallel().forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    });
    System.out.println("Parallel");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

代码片段 3（在线程池中的并行线程中处理）：

public long doThreadPoolParallel() throws InterruptedException, ExecutionException {
    ForkJoinPool customThreadPool = new ForkJoinPool(4);
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    customThreadPool.submit(() -> ds.parallel().forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    })).get();
    System.out.println("Thread Pool");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

输出在这里：

Parallel
6553089257123798384
34 <--34 milli seconds

Thread Pool
6553089257123798384
23 <--23 milli seconds

Sequence
6553089257123798384
12 <--12 milli seconds!

我的预期是

1) 使用线程池进行处理的时间应该是最短的，但事实并非如此。（注意我没有包括线程池的创建时间，所以应该很快）

2) 没想到顺序运行的代码是最快的，应该是什么原因。

我使用的是四核处理器。

感谢任何帮助解释上述歧义！

【问题讨论】：

您是否阅读过 Java 中正确的微基准测试？ stackoverflow.com/questions/504103/… -- 你实际上是如何调用你的基准测试的？你在预热 JVM 吗？
@ErwinBolwidt 我没有遵循微基准测试中提到的所有要点。但在打印上述数字之前，我已经完成了 JVM 预热。顺序处理总是比其他同行快，这真是令人费解！
您的线程可能在原子操作上花费了太多时间。
首先，您应该使用System.nanoTime() 来测量经过的时间。此外，如果您声称要测试流处理，则应该进行流处理而不是伪装的循环代码，即IntStream.range(0, 100000) .parallel() .map(num -> (long)Math.pow(num, 3) + (long)Math.pow(num, 2)) .sum()。然后，尝试使用更大的范围来查看它是如何扩展的。这允许识别固定开销部分。请注意，顺便说一下，当不通过new Double(…).longValue() 对其进行混淆时，对long 的转换如何变得更简单......
map → mapToLong...

标签： java multithreading java-8 java-stream

【解决方案1】：

您的比较并不完美，肯定是因为缺少虚拟机预热。当我简单地重复执行时，我会得到不同的结果：

System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());
System.out.println("-------");
System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());
System.out.println("-------");
System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());

结果：

Parallel
6553089257123798384
65
Thread Pool
6553089257123798384
13
Sequence
6553089257123798384
14
-------
Parallel
6553089257123798384
9
Thread Pool
6553089257123798384
4
Sequence
6553089257123798384
8
-------
Parallel
6553089257123798384
8
Thread Pool
6553089257123798384
3
Sequence
6553089257123798384
8

正如@Erwin 在 cmets 中指出的那样，请查看this question 上的答案（在本例中为规则 1），以了解如何正确执行此基准测试。

并行流的默认并行度不一定与 fork-join 池提供的并行度相同，该池具有与计算机上的内核一样多的线程，尽管当我从您的切换时结果之间的差异仍然可以忽略不计自定义池到公共分叉连接池。

【讨论】：

嗯，这正是我想要的，线程池应该表现得更好。你的结果反映了这一点。我将在一台安静的机器上运行我的 sn-ps 并对其进行基准测试stackoverflow.com/questions/504103/… 可能就是这样的区别。

【解决方案2】：

AtomicLong.addAndGet 需要线程同步 - 每个线程都必须看到前一个 addAndGet 的结果 - 你可以指望总数是正确的。

虽然这不是传统的synchronized 同步，但它仍然有开销。在 JDK7 中，addAndGet 在 Java 代码中使用了自旋锁。在 JDK8 中，它变成了一个内在函数，然后由 Intel 平台上的 HotSpot 发出的LOCK:XADD 指令实现。

它需要 CPU 之间的缓存同步，这会产生开销。它甚至可能需要从主内存中刷新和读取内容，这与不需要这样做的代码相比非常慢。

很有可能，因为这种同步开销发生在测试中的每次迭代中，所以开销大于并行化带来的任何性能提升。

参考资料：

【讨论】：