反向迭代、for 或 while 循环哪个更快？答案

【问题标题】：Which is faster for reverse iteration, for or while loops?反向迭代、for 或 while 循环哪个更快？
【发布时间】：2016-06-03 06:18:33
【问题描述】：

我正在尝试在 Rust 中实现标准的 memmove 函数，我想知道哪种方法对于向下迭代更快（其中 src dest）：

for i in (0..n).rev() {
    //Do copying
}

或

let mut i = n;
while i != 0 {
    i -= 1;
    // Do copying
}

for 循环版本中的rev() 会显着减慢速度吗？

【问题讨论】：

我发现由于某种原因，这段代码的for 循环版本奇怪地失败了，出现了页面错误和一般保护错误（我对低级开发很陌生）。由于这个原因，现在我将使用while 循环版本。
如果 while 循环没有失败，则 for 循环没有失败的原因；问题可能出在其他地方。

标签： for-loop while-loop rust memmove

【解决方案1】：

TL;DR：使用for 循环。

两者都应该同样快。我们可以非常简单地检查编译器剥离 for 循环中涉及的抽象层的能力：

#[inline(never)]
fn blackhole() {}

#[inline(never)]
fn with_for(n: usize) {
    for i in (0..n).rev() { blackhole(); }
}

#[inline(never)]
fn with_while(n: usize) {
    let mut i = n;
    while i > 0 {
        blackhole();
        i -= 1;
    }
}

这会生成这个 LLVM IR：

; Function Attrs: noinline nounwind readnone uwtable
define internal void @_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
  ret void
}

; Function Attrs: noinline nounwind readnone uwtable
define internal void @_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
  ret void
}

即使您不精通 LLVM，很明显这两个函数都编译为相同的 IR（因此显然编译为相同的程序集）。

由于它们的性能相同，应该更喜欢更明确的for 循环，并将while 循环保留给迭代不规则的情况。

编辑：解决 starblue 对身体不适的担忧。

#[link(name = "snappy")]
extern {
    fn blackhole(i: libc::c_int) -> libc::c_int;
}

#[inline(never)]
fn with_for(n: i32) {
    for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}

#[inline(never)]
fn with_while(n: i32) {
    let mut i = n;
    while i > 0 {
        unsafe { blackhole(i as libc::c_int); }
        i -= 1;
    }
}

编译为：

; Function Attrs: noinline nounwind uwtable
define internal void @_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
  %1 = icmp sgt i32 %0, 0
  br i1 %1, label %match_case.preheader, label %clean_ast_95_

match_case.preheader:                             ; preds = %entry-block
  br label %match_case

match_case:                                       ; preds = %match_case.preheader, %match_case
  %.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
  %2 = add i32 %.in, -1
  %3 = tail call i32 @blackhole(i32 %2)
  %4 = icmp sgt i32 %2, 0
  br i1 %4, label %match_case, label %clean_ast_95_.loopexit

clean_ast_95_.loopexit:                           ; preds = %match_case
  br label %clean_ast_95_

clean_ast_95_:                                    ; preds = %clean_ast_95_.loopexit, %entry-block
  ret void
}

; Function Attrs: noinline nounwind uwtable
define internal void @_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
  %1 = icmp sgt i32 %0, 0
  br i1 %1, label %while_body.preheader, label %while_exit

while_body.preheader:                             ; preds = %entry-block
  br label %while_body

while_exit.loopexit:                              ; preds = %while_body
  br label %while_exit

while_exit:                                       ; preds = %while_exit.loopexit, %entry-block
  ret void

while_body:                                       ; preds = %while_body.preheader, %while_body
  %i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
  %2 = tail call i32 @blackhole(i32 %i.05)
  %3 = add nsw i32 %i.05, -1
  %4 = icmp sgt i32 %i.05, 1
  br i1 %4, label %while_body, label %while_exit.loopexit
}

核心循环是：

; -- for loop
match_case:                                       ; preds = %match_case.preheader, %match_case
  %.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
  %2 = add i32 %.in, -1
  %3 = tail call i32 @blackhole(i32 %2)
  %4 = icmp sgt i32 %2, 0
  br i1 %4, label %match_case, label %clean_ast_95_.loopexit

; -- while loop
while_body:                                       ; preds = %while_body.preheader, %while_body
  %i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
  %2 = tail call i32 @blackhole(i32 %i.05)
  %3 = add nsw i32 %i.05, -1
  %4 = icmp sgt i32 %i.05, 1
  br i1 %4, label %while_body, label %while_exit.loopexit

唯一的区别是：

在调用blackhole之前递减，之后递减
for 与 0 比较，while 与 1 比较

否则，就是同一个核心循环。

【讨论】：

这是一个不好的例子，因为编译器注意到这两个函数什么都不做。一些非平凡的计算会更有趣（也许计算并返回总和）。
@starblue：实际上，编译器注意到它什么都不做是（恕我直言）一个完美的例子=>它正是我想要的，编译器可以剥离@中涉及的抽象层987654331@ 迭代器，它的反转，以及在for 循环中对其反转形式的迭代。抽象（例如for 循环）的风险始终是编译器无法将它们优化掉；这个例子证明这里不是这样的。
@starblue：给你，使用 LLVM 无法内联的更好的黑洞。当然，结果是相同的，因为为了优化 LLVM 首先必须将其缩减为裸组件。

【解决方案2】：

简而言之：它们（几乎）同样快——使用for循环！

加长版：

首先：rev() 仅适用于实现DoubleEndedIterator 的迭代器，它提供了next_back() 方法。该方法预计在o(n)（次线性时间）内运行，通常甚至在O(1)（恒定时间）内运行。事实上，通过查看implementation of next_back() for Range，我们可以看到它以恒定的时间运行。

现在我们知道这两个版本具有渐近相同的运行时。如果是这种情况，您通常应该停止考虑它并使用更惯用的解决方案（在这种情况下为for）。过早考虑优化通常会降低编程效率，因为性能仅在您编写的所有代码中很重要。

但是由于您正在实施memmove，因此性能实际上对您来说可能真的很重要。因此，让我们尝试查看生成的 ASM。我使用了这段代码：

#![feature(start)]
#![feature(test)]

extern crate test;

#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
    for i in (0..n).rev() { 
        test::black_box(i); 
    }
}

#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
    let mut i = n;
    while i > 0 {
        test::black_box(i);
        i -= 1;
    }
}

#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
    let random_enough_value = unsafe {
        **vargs as usize
    };

    with_for(random_enough_value);
    with_while(random_enough_value);
    0
}

(Playground Link)

#[no_mangle] 是为了提高生成的 ASM 的可读性。 #inline(never) 和 random_enough_value 以及 black_box 用于防止 LLVM 优化我们不想优化的东西。生成的 ASM（在发布模式下！）进行了一些清理，如下所示：

with_for:                       |   with_while:
    testq   %rdi, %rdi          |       testq   %rdi, %rdi
    je  .LBB0_3                 |       je  .LBB1_3
    decq    %rdi                |   
    leaq    -8(%rsp), %rax      |       leaq    -8(%rsp), %rax
.LBB0_2:                        |   .LBB1_2:
    movq    %rdi, -8(%rsp)      |       movq    %rdi, -8(%rsp)
    decq    %rdi                |       decq    %rdi
    cmpq    $-1, %rdi           |       
    jne .LBB0_2                 |       jne .LBB1_2
.LBB0_3:                        |   .LBB1_3:
    retq                        |       retq

唯一的区别是with_while 少了两条指令，因为它像with_for 那样倒计时到 0 而不是 -1。

结论：如果您可以判断渐近运行时是最优的，那么您可能根本不应该考虑优化。现代优化器足够聪明，可以将高级构造编译成非常完美的 ASM。通常，无论如何，数据布局和由此产生的缓存效率比最小数量的指令重要得多。

如果您确实需要考虑优化，请查看 ASM（或 LLVM IR）。在这种情况下，for 循环实际上要慢一些（更多指令，与 -1 而不是 0 相比）。但是，Rust 程序员应该关心这一点的情况可能很少。

【讨论】：

【解决方案3】：

对于小的N，真的应该没关系。

Rust 在迭代器上是惰性的； 0..n 在您真正询问元素之前不会引起任何评估。 rev() 首先要求最后一个元素。据我所知，Rust 计数器迭代器很聪明，不需要生成第一个 N-1 元素即可获得第一个 N。在这种特定情况下，rev 方法可能更快。

一般情况下，这取决于你的迭代器有什么样的访问范式和访问时间；确保访问末端需要恒定的时间，并且没有任何区别。

与所有基准测试问题一样，它取决于。亲自测试您的 N 值！

过早的优化也是有害的，所以如果你的 N 很小，并且你的循环不经常完成......别担心。

【讨论】：

请注意，rev 仅在迭代器实现 DoubleEndedIterator 时可用，并且通常仅在访问“下一个最后”元素是 O(1) 时才实现。