尾递归与非尾递归。前者慢吗？答案

【问题标题】：Tail recursion vs non tail recursion. Is the former slower?尾递归与非尾递归。前者慢吗？
【发布时间】：2019-10-30 04:52:03
【问题描述】：

我正在学习函数式编程和 Erlang 的基础知识，并且我已经实现了三个版本的阶乘函数：使用带保护的递归、使用带模式匹配的递归和使用尾递归。

我正在尝试比较每个阶乘实现的性能（Erlang/OTP 22 [erts-10.4.1]）：

%% Simple factorial code:
fac(N) when N == 0 -> 1;
fac(N) when N > 0 -> N * fac(N - 1).

%% Using pattern matching:
fac_pattern_matching(0) -> 1;
fac_pattern_matching(N) when N > 0 -> N * fac_pattern_matching(N - 1).

%% Using tail recursion (and pattern matching):
tail_fac(N) -> tail_fac(N, 1).

tail_fac(0, Acc) -> Acc;
tail_fac(N, Acc) when N > 0 -> tail_fac(N - 1, N * Acc).

计时器助手：

-define(PRECISION, microsecond).

execution_time(M, F, A, D) ->
  StartTime = erlang:system_time(?PRECISION),
  Result = apply(M, F, A),
  EndTime = erlang:system_time(?PRECISION),
  io:format("Execution took ~p ~ps~n", [EndTime - StartTime, ?PRECISION]),
  if
    D =:= true -> io:format("Result is ~p~n", [Result]);
    true -> ok
  end
.

执行结果：

递归版本：

3> mytimer:execution_time(factorial, fac, [1000000], false).
Execution took 1253949667 microseconds
ok

带有模式匹配版本的递归：

4> mytimer:execution_time(factorial, fac_pattern_matching, [1000000], false).
Execution took 1288239853 microseconds
ok

尾递归版本：

5> mytimer:execution_time(factorial, tail_fac, [1000000], false).
Execution took 1405612434 microseconds
ok

我期待尾递归版本的性能比其他两个版本更好，但令我惊讶的是它的性能较差。这些结果与我的预期完全相反。

为什么？

【问题讨论】：

编写基准测试很难。你确定你正在测量你认为你正在测量的东西吗？您是否考虑了统计效应？您是否考虑了动态自适应优化？你考虑到环境了吗？以下是您需要在基准测试中考虑哪些非显而易见的事情的几个示例：groups.google.com/forum/#!msg/mechanical-sympathy/icNZJejUHfE/…、*.com/a/513259/2988。这些主要讨论的是 HotSpot，但大多数问题适用于任何具有动态自适应优化的现代高性能执行引擎。

标签： erlang tail-recursion

【解决方案1】：

问题在于您选择的功能。阶乘是一个增长非常快的函数。 Erlang 已经实现了大整数运算，所以它不会溢出。您正在有效地衡量底层大整数实现的好坏。 100万！是一个巨大的数字。它是 8.26×10^5565708，相当于 5.6MB 长，写成十进制数。您的 fac/1 和 tail_fac/1 在大整数实现开始时达到大数字的速度以及数字增长的速度之间存在差异。在您的fac/1 实现中，您正在有效地计算1*2*3*4*...*N。在您的tail_fac/1 实现中，您正在计算N*(N-1)*(N-2)*(N-3)*...*1。你看到那里的问题了吗？你可以用不同的方式编写尾调用实现：

tail_fac2(N) when is_integer(N), N > 0 ->
    tail_fac2(N, 0, 1).

tail_fac2(X, X, Acc) -> Acc;
tail_fac2(N, X, Acc) ->
    Y = X + 1,
    tail_fac2(N, Y, Y*Acc).

它会工作得更好。我不像你那样有耐心，所以我会测量一些较小的数字，但新的 fact:tail_fac2/1 应该每次都优于 fact:fac/1：

1> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7743768
2> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7629604
3> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7651739
4> element(1, timer:tc(fun()-> fact:tail_fac(100000) end)).
7229662
5> element(1, timer:tc(fun()-> fact:tail_fac(100000) end)).
7104056
6> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6491195
7> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6506565
8> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6519624

如您所见，fact:tail_fac2/1 N = 100000 需要 6.5 秒，fact:tail_fac/1 需要 7.2 秒，fact:fac/1 需要 7.6 秒。即使更快的增长也不会推翻尾部调用的好处，所以尾部调用版本比体递归更快，可以清楚地看到，fact:tail_fac2/1 中累加器的较慢增长显示了它的影响。

如果您选择不同的函数进行尾调用优化测试，您可以更清楚地看到尾调用优化的影响。例如总和：

sum(0) -> 0;
sum(N) when N > 0 -> N + sum(N-1).

tail_sum(N) when is_integer(N), N >= 0 ->
    tail_sum(N, 0).

tail_sum(0, Acc) -> Acc;
tail_sum(N, Acc) -> tail_sum(N-1, N+Acc).

速度是：

1> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
970749
2> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
126288
3> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
113115
4> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
104371
5> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
125857
6> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
92282
7> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
92634
8> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
68047
9> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
87748
10> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
94233

如您所见，我们可以轻松地使用N=10000000，它的运行速度非常快。无论如何，体递归函数明显慢于 110 毫秒和 85 毫秒。您会注意到fact:sum/1 的第一次运行比其余运行花费的时间长 9 倍。这是因为体递归函数消耗了一个堆栈。当您使用尾递归对应物时，您不会看到这种效果。（试一试。）如果您在单独的过程中运行每个测量，您会看到差异。

1> F = fun(G, N) -> spawn(fun() -> {T, _} = timer:tc(fun()-> fact:G(N) end), io:format("~p took ~bus and ~p heap~n", [G, T, element(2, erlang:process_info(self(), heap_size))]) end) end.
#Fun<erl_eval.13.91303403>
2> F(tail_sum, 10000000).
<0.88.0>
tail_sum took 70065us and 987 heap
3> F(tail_sum, 10000000).
<0.90.0>
tail_sum took 65346us and 987 heap
4> F(tail_sum, 10000000).
<0.92.0>
tail_sum took 65628us and 987 heap
5> F(tail_sum, 10000000).
<0.94.0>
tail_sum took 69384us and 987 heap
6> F(tail_sum, 10000000).
<0.96.0>
tail_sum took 68606us and 987 heap
7> F(sum, 10000000).
<0.98.0>
sum took 954783us and 22177879 heap
8> F(sum, 10000000).
<0.100.0>
sum took 931335us and 22177879 heap
9> F(sum, 10000000).
<0.102.0>
sum took 934536us and 22177879 heap
10> F(sum, 10000000).
<0.104.0>
sum took 945380us and 22177879 heap
11> F(sum, 10000000).
<0.106.0>
sum took 921855us and 22177879 heap

【讨论】：

【解决方案2】：

Erlang 文档指出

It is generally not possible to predict whether the tail-recursive 
or the body-recursive version will be faster. Therefore, use the version that
makes your code cleaner (hint: it is usually the body-recursive version).

http://erlang.org/doc/efficiency_guide/myths.html

【讨论】：

你错过了一个重要的部分：根据神话，使用尾递归函数反向构建一个列表，然后调用lists:reverse/1 比以正确顺序构建列表的主体递归函数；原因是体递归函数比尾递归函数使用更多内存。
我刚刚引用了文档。另请注意“它通常无法预测..”并且您引用了**根据神话**并且您在以下词之前删除了引用：“那是真实的在某种程度上在 R12B 之前。在 R7B 之前更是如此。今天，不是那么多。体递归函数通常使用与尾递归函数相同的内存量 "
不，你不明白。整个 2.3 部分是关于 函数，它反向构建一个列表，然后调用lists:reverse/1。因此，如果您不构建列表或不调用lists:reverse/1，您的主体递归函数通常不使用与尾递归函数相同的内存量，因此您可以预测100% 确定，尾递归函数将更有效。因此，您的答案与 OP 无关，因为 OP 既不构建列表也不调用lists:reverse/1。试试吧！反驳它！
用 C 编写的 NIF 与上面的示例有什么关系？
没什么，这就是为什么你的答案是无关紧要的。我第四次告诉你！您引用的神话反驳是关于尾递归函数，它反向构建一个列表，然后调用 lists:reverse/1` 它与上面的示例无关。这就是为什么你的答案是不相关的、错误的和误导性的！