为什么这个递归函数比迭代函数快 3 倍？答案

【问题标题】：Why is this recursive function 3x faster than iterative one?为什么这个递归函数比迭代函数快 3 倍？
【发布时间】：2020-01-09 05:00:36
【问题描述】：

我有一个简单的递归函数，可以构造一定深度的二叉树。

我认为带有 DFS 堆栈的迭代版本会达到类似的性能，但速度却惊人地慢了 3 倍！

更准确地说，在我的机器上，深度为 15 的递归版本大约需要 330_000 ns，而带有堆栈的迭代版本需要大约 950_000 ns。

能否将令人惊讶的性能归因于优越的缓存局部性（这对于递归函数来说显然更好）。

我用于性能基准测试的代码：

class Main {
    public static void main(String[] args) {
        long startTime = System.nanoTime();
        long runs;
        Tree t = null;
        for(runs=0; (System.nanoTime() - startTime)< 3_000_000_000L ; runs++) {
            t = createTree3(15);
        }
        System.out.println((System.nanoTime() - startTime) / runs + " ns/call");
    }

    static Tree createTree(int depth) {
        Tree t = new Tree();
        createTreeHlp(t, depth);
        return t;
    }

    static void createTreeHlp(Tree tree, int depth) {
        if (depth == 0)
            tree.init(0, null, null);
        else {
            tree.init(depth, new Tree(), new Tree());
            createTreeHlp(tree.leftChild, depth -1);
            createTreeHlp(tree.rghtChild, depth -1);
        }
    }


    static Tree createTree3(int depth_) {
        TreeStack stack = new TreeStack();
        Tree result = new Tree();
        stack.put(result, depth_);
        while (!stack.isEmpty()) {
            int depth = stack.depth[stack.stack][stack.index];
            Tree tree = stack.tree[stack.stack][stack.index];
            stack.dec();
            if (depth == 0)
                tree.init(0, null, null);
            else {
                tree.init(depth, new Tree(), new Tree());
                stack.put(tree.leftChild, depth -1);
                stack.put(tree.rghtChild, depth -1);
            }
        }
        return result;
    }
}

class Tree {
    int payload;
    Tree leftChild;
    Tree rghtChild;

    public Tree init(int payload, Tree leftChild, Tree rghtChild) {
        this.leftChild = leftChild;
        this.rghtChild = rghtChild;
        this.payload = payload;
        return this;
    }

    @Override
    public String toString() {
        return "Tree(" +payload+", "+ leftChild + ", " + rghtChild + ")";
    }
}
class TreeStack {

    Tree[][] tree;
    int[][] depth;

    int stack =  1;
    int index = -1;

    TreeStack() {
        this.tree = new Tree[100][];
        this.depth = new int[100][];

        alloc(100_000);
        --stack;
        alloc(0);
    }

    boolean isEmpty() {
        return index == -1;
    }

    void alloc(int size) {
        tree[stack] = new Tree[size];
        depth[stack] = new int[size];
    }

    void inc() {
        if (tree[stack].length == ++index) {
            if (tree[++stack] == null) alloc(2 * index);
            index = 0;
        }
    }
    void dec() {
        if (--index == -1)
            index = tree[--stack].length - 1;
    }

    void put(Tree tree, int depth) {
        inc();
        this.tree[stack][index] = tree;
        this.depth[stack][index] = depth;
    }
}

【问题讨论】：

也许你会得到一些见解using a profiler？
我尝试使用 VisualVM 分析代码，但它只告诉我 70% 的自我时间花费在 createTree1 和 createTree3 的正文中。剩下的就是 Tree init。堆栈操作不到 10%。
您有什么理由编写自己的堆栈而不是使用ArrayDeque？我仍然看到 AD 的迭代缓慢，但只是好奇。
这不仅仅是递归与迭代。迭代解决方案太复杂了，我认为它在处理堆栈上花费了很多时间。有可能做出更好的迭代解决方案。
@Donat 我会对不太复杂的迭代版本非常感兴趣，只要它不改变遍历顺序 (DFS)。

标签： java recursion

【解决方案1】：

简答：因为你是这样编码的。

长答案：您创建一个堆栈，将内容放入其中，从中获取内容，并且非常复杂。让我们简单地为这种情况做。您想要一个具有一定深度的树，其中包含所有子级，值是深度，您首先想要最深的级别。这是一个简单的方法：

static Tree createTree3(int depth_) {
    Tree[] arr = new Tree[1 << depth_];

    int count = 1 << depth_;
    for (int i=0; i<count; i++)
        arr[i] = new Tree().init(0, null, null);

    int d = 1;
    count >>= 1;
    while (count > 0)
    {
        for (int i=0; i<count; i++)
        {
            Tree t = new Tree().init(d, arr[i * 2], arr[i * 2 + 1]);
            arr[i] = t;
        }
        count >>= 1;
        d++;
    }

  return arr[0];
}

它首先创建最低级别的节点，其中有 2^depth 的。然后它创建下一级节点并添加子节点。然后是下一个和下一个。没有堆栈，没有递归，只是简单的循环。

我通过运行 20000 次到深度 14 来对它进行基准测试，因此不需要花时间或其他任何东西，只需创建树即可。我的 i7 笔记本电脑上的结果：

您的递归需要 ~187µs/tree
我的迭代需要 ~177µs/tree

如果我运行深度 15，那么它是 311 与 340。

时间会发生变化，因为它检查的不是 CPU 时间而是系统时间，这取决于 JITter 的处理方式是否不同等等。

但简而言之：在这种情况下，即使进行了这个简单的更改，迭代也可以轻松实现与递归一样快，我相信还有更智能的方法。

【讨论】：

这种方法的问题是它不会以与递归相同的顺序遍历树，因此您正在比较两种不同的算法。仅出于这个原因，我不能接受这个答案。
@FordO。那应该是什么顺序？你想要一个 DFS，这就是它的方式。如果你愿意，你也可以用其他方式来做，很容易修改
你没有像递归版本那样分配树节点。
想象树不平衡。
@FordO。分配顺序在这里并不重要。你问了这个具体案例，所以我回答了。最终结果才是最重要的，否则你会得到相同的代码，它们的工作原理相同，毫无疑问。或者你必须更具体。最后的答案仍然是：你写了一个缓慢的实现。