Java Streams：如何进行高效的“区分和排序”？答案

【问题标题】：Java Streams: How to do an efficient "distinct and sort"?Java Streams：如何进行高效的“区分和排序”？
【发布时间】：2017-05-05 13:37:26
【问题描述】：

假设我有一个 Stream<T> 并且只想获取不同的元素并进行排序。

天真的方法是只做以下事情：

Stream.of(...)
    .sorted()
    .distinct()

或者，也许反过来：

Stream.of(...)
    .distinct()
    .sorted()

由于 JDK 的源代码无法真正访问它们的实现，我只是想知道可能的内存消耗和性能影响。

或者像下面这样编写我自己的过滤器会更有效吗？

Stream.of(...)
    .sorted()
    .filter(noAdjacentDuplicatesFilter())

public static Predicate<Object> noAdjacentDuplicatesFilter() {
    final Object[] previousValue = {new Object()};

    return value -> {
        final boolean takeValue = !Objects.equals(previousValue[0], value);
        previousValue[0] = value;
        return takeValue;
    };
}

【问题讨论】：

在最好的情况下，底层实现会识别，如果distinct() 和sort() 相互跟随并将它们融合到一个操作中。请记住，Streams 是惰性的，在您链接终端操作之前它不会做任何事情，此时，它知道您链接了什么。
@Holger 我明白；如果这种情况真的发生并且这种行为是否得到保证，我会很感兴趣。
好吧，我想这也取决于数据的性质：很少有不同的值多次出现，或者很多不同的值有一些重复......在第二种情况下 sorted() 然后distinct() 更好；在第一种情况下，可能是 distinct() 然后 sort() 可能更快，尤其是对于分散的数据。我的两分钱。

标签： java performance java-8 java-stream

【解决方案1】：

当您在sorted() 之后链接distinct() 操作时，实现将利用数据的排序特性并避免构建内部HashSet，可以通过以下程序演示

public class DistinctAndSort {
    static int COMPARE, EQUALS, HASHCODE;
    static class Tracker implements Comparable<Tracker> {
        static int SERIAL;
        int id;
        Tracker() {
            id=SERIAL++/2;
        }
        public int compareTo(Tracker o) {
            COMPARE++;
            return Integer.compare(id, o.id);
        }
        public int hashCode() {
            HASHCODE++;
            return id;
        }
        public boolean equals(Object obj) {
            EQUALS++;
            return super.equals(obj);
        }
    }
    public static void main(String[] args) {
        System.out.println("adjacent sorted() and distinct()");
        Stream.generate(Tracker::new).limit(100)
              .sorted().distinct()
              .forEachOrdered(o -> {});
        System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
                          COMPARE, EQUALS, HASHCODE);
        COMPARE=EQUALS=HASHCODE=0;
        System.out.println("now with intermediate operation");
        Stream.generate(Tracker::new).limit(100)
            .sorted().map(x -> x).distinct()
            .forEachOrdered(o -> {});
        System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
                          COMPARE, EQUALS, HASHCODE);
    }
}

将打印出来

adjacent sorted() and distinct()
compareTo: 99, EQUALS: 99, HASHCODE: 0
now with intermediate operation
compareTo: 99, EQUALS: 100, HASHCODE: 200

Stream 实现无法识别像map(x -> x) 这样简单的中间操作，因此，它必须假设元素可能没有根据映射函数的结果进行排序。

无法保证会发生这种优化，但是，可以合理地假设 Stream 实现的开发人员不会删除该优化，甚至会尝试添加更多优化，因此滚动您自己的实现会阻止您的代码受益于未来的优化。

此外，您创建的是“有状态谓词”，强烈建议不要这样做，当然，在与并行流一起使用时会中断。

如果您不相信 Stream API 能够足够高效地执行此操作，那么最好在不使用 Stream API 的情况下实现此特定操作。

【讨论】：

实际上我需要有状态谓词来过滤代码中其他地方的结果。这里也鼓励：stackoverflow.com/questions/27870136/…
这不是真的鼓励，对于这种任务根本没有更清洁的替代解决方案，但是，也许这个答案应该更好地强调缺点。请注意，它使用ConcurrentHashMap 来确保它在同时使用时不会完全中断。与并行流一起使用时仍然不会保持遇到顺序，但由于特定任务有一个附加作为终端操作，在那里可能无关紧要，但在其他情况下可能会严重中断。

【解决方案2】：

免责声明：我知道性能测试很困难，尤其是在需要预热的 JVM 和没有其他进程运行的受控环境中。

如果我对其进行测试，我会得到这些结果，因此您的实现似乎有利于并行执行。（在具有 4 核 + 超线程的 i7 上运行）。

所以“.distinct().sorted()”似乎更慢。正如 Holger 预测/解释的那样

Round 1 (Warm up?)
3938
2449
5747
Round 2
2834
2620
3984
Round 3 Parallel
831
4343
6346
Round 4 Parallel
825
3309
6339

使用代码：

package test.test;

import java.util.Collections;
import java.util.List;
import java.util.Objects;
import java.util.function.Predicate;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class SortDistinctTest {

    public static void main(String[] args) {
        IntStream range = IntStream.range(0, 6_000_000);
        List<Integer> collect = range.boxed().collect(Collectors.toList());
        Collections.shuffle(collect);

        long start = System.currentTimeMillis();

        System.out.println("Round 1 (Warm up?)");
        collect.stream().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
        long fst = System.currentTimeMillis();
        System.out.println(fst - start);

        collect.stream().sorted().distinct().collect(Collectors.counting());
        long snd = System.currentTimeMillis();
        System.out.println(snd - fst);

        collect.stream().distinct().sorted().collect(Collectors.counting());
        long end = System.currentTimeMillis();
        System.out.println(end - snd);

        System.out.println("Round 2");
        collect.stream().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
        fst = System.currentTimeMillis();
        System.out.println(fst - end);

        collect.stream().sorted().distinct().collect(Collectors.counting());
        snd = System.currentTimeMillis();
        System.out.println(snd - fst);

        collect.stream().distinct().sorted().collect(Collectors.counting());
        end = System.currentTimeMillis();
        System.out.println(end - snd);

        System.out.println("Round 3 Parallel");
        collect.stream().parallel().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
        fst = System.currentTimeMillis();
        System.out.println(fst - end);

        collect.stream().parallel().sorted().distinct().collect(Collectors.counting());
        snd = System.currentTimeMillis();
        System.out.println(snd - fst);

        collect.stream().parallel().distinct().sorted().collect(Collectors.counting());
        end = System.currentTimeMillis();
        System.out.println(end - snd);

        System.out.println("Round 4 Parallel");
        collect.stream().parallel().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
        fst = System.currentTimeMillis();
        System.out.println(fst - end);

        collect.stream().parallel().sorted().distinct().collect(Collectors.counting());
        snd = System.currentTimeMillis();
        System.out.println(snd - fst);

        collect.stream().parallel().distinct().sorted().collect(Collectors.counting());
        end = System.currentTimeMillis();
        System.out.println(end - snd);

    }

    public static Predicate<Object> noAdjacentDuplicatesFilter() {
        final Object[] previousValue = { new Object() };

        return value -> {
            final boolean takeValue = !Objects.equals(previousValue[0], value);
            previousValue[0] = value;
            return takeValue;
        };

    }

}

【讨论】：

我预测，唯一的事情是noAdjacentDuplicatesFilter() 会在并行流中产生不正确的结果。那么性能就无关紧要了。
好吧，如果你这么说的话。但我想快速产生错误结果比缓慢产生错误结果要好；）快速失败。
“快速失败”意味着以可识别的方式快速失败，例如抛出异常。这与产生不正确的结果完全相反。这意味着对错误请求的响应，而不是产生失败的建议，可能产生正确的结果。
因此眨眼表情符号
不确定，可能是因为我之前遇到过一些人真的更喜欢速度而不是正确性......