为什么 for 循环之间的差异很大答案

【问题标题】：why the large difference, between for loops为什么 for 循环之间的差异很大
【发布时间】：2013-12-12 18:23:54
【问题描述】：

我很好奇for(;;) 和for(:) 之间的区别，尤其是两者之间的速度。因此，我通过一个包含 1000 万个整数的向量并将它们全部加到一个 for 中进行了一个小测试。我发现for(:) 慢了 1.3。

什么会导致for(:) 这么慢！？

编辑：似乎 for(:) 使用了向量的迭代器不像 for(;;) 让它更长。

/Yu"stdafx.h" /GS /analyze- /W3 /Zc:wchar_t /ZI /Gm /Od /sdl /Fd"Debug\vc120.pdb" /fp:precise /D"WIN32" /D" _DEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /RTC1 /Gd /Oy- /MDd /Fa"Debug\" /EHsc /nologo /Fo"Debug\" /Fp"Debug\forvsForLoop.pch"

#include "stdafx.h"
#include <vector>
#include <iostream>
#include <chrono>

void init(std::vector<int> &array){
    srand(20);
    for (int x = 0; x < 10000000; x++)
        array.push_back(rand());
    return;
}

unsigned long testForLoop(std::vector<int> &array){
    unsigned long result = 0;
    for (int x = 0; x < array.size(); x++)
        result += array[x];
    return result;
}
unsigned long testFor(std::vector<int> &array){
    unsigned long result = 0;
    for (const int &element : array)
        result += element;
    return result;
}
int _tmain(int argc, _TCHAR* argv[])
{
    std::vector<int> testingArray;

    init(testingArray);

    //Warm up
    std::cout << "warming up \n";
    testForLoop(testingArray);
    testFor(testingArray);
    testForLoop(testingArray);
    testFor(testingArray);
    testForLoop(testingArray);
    testFor(testingArray);
    std::cout << "starting \n";

    auto start = std::chrono::high_resolution_clock::now();
    testForLoop(testingArray);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "ForLoop took: " <<  std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;


    start = std::chrono::high_resolution_clock::now();
    testFor(testingArray);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "For---- took: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;

    system("pause");
    return 0;

}

【问题讨论】：

除了提交代码之外，请确保您的基准测试在优化的情况下运行。
是的，基于范围的 for 循环使用迭代器，就像您可以使用非基于范围的 for 循环一样。参见例如this reference 用于典型实现。如果您不在其他 for 循环中使用迭代器，则测试不相等。
您正在对未优化的代码进行基准测试（/Od、/D_DEBUG 等）。这就像通过衡量谁最能阅读地图来确定跑得最快的人一样。开启优化并重试。
优化循环，并且都需要 0 纳秒才能完成。
我使用函数的返回值给它们一个目的，这样它们就不会被删除，而 for(;;) 采用：for(:) 的 6006000ns 和 8001200ns

标签： c++

【解决方案1】：

答案是猜测，取决于所使用的确切代码和优化。底层平台也可以改变代码行为的工作方式。

管理迭代基本上有两种“低级”方法：一种基于“可重新分配的指针”，另一种基于“常量指针和偏移量”。

在伪代码中

loop { *a = *b; ++a; ++b; }

对

loop { a[i] = b[i]; ++i; }

根据处理器架构的不同，两者在寄存器、地址局部性和缓存的使用方面具有不同的行为：第一个有两个和一个内存保持的常数，第二个有两个和一个寄存器和寄存器增量。（并且两者都有内存副本）

在 x86 平台上，第二个更可取，因为内存访问更少，并且使用的指令需要更少的内存获取。

现在，基于迭代器的循环应用于向量（其迭代器包含指针）导致第一种形式，而传统的基于索引的循环导致第二种形式。

现在for(a: v) { .... } 与for(auto i=v.begin(); i!=v.end(); ++i) { auto& a=*i; ... } 相同

它适用于任何形式的容器（也不是内存顺序），但不能减少到基于索引的容器。除非编译器优化好到发现迭代器实际上是一个以恒定增量移动的指针。

【讨论】：

【解决方案2】：

为了确保测试没有被优化，我打印了结果：

 auto x = testForLoop(......

 // ^^^
 ......nd - start).count() << "  R: " << x << std::endl;

                          //  ^^^^^^^^^^^^^^^^

普通模式：（约半速）

> g++ -std=c++11 v.cpp
> ./a.out
warming up
starting
ForLoop took: 33262788  R: 10739647121123056
For---- took: 51263111   R: 10739647121123056

优化：（几乎相同）

> g++ -O3 -std=c++11 v.cpp
> ./a.out
warming up
starting
ForLoop took: 4861314  R: 10739647121123056
For---- took: 4997957   R: 10739647121123056

【讨论】：

【解决方案3】：

该标准没有说明性能或实施。两个循环都应该正常工作，并且在正常情况下性能应该相同。除非他声称这是一个错误或糟糕的实现，否则没有人能说出为什么它在 MSVC++ 中太慢了。也许您应该正确更改优化设置。

我已经在MSVC++、GCC 和Clang 中测试了您的代码。

GCC 输出

ForLoop took: 7879773
For---- took: 5786831

Clang 输出

ForLoop took: 6537441
For---- took: 6743614

和 MSVC++ 输出

ForLoop took: 77786200
For---- took: 249612200

GCC 和 Clang 都有合理的结果，并且两个循环按预期彼此接近。但是 MSVC++ 的结果是模糊和不切实际的。我称之为错误或回归。或者，你的配置不好编译，试试其他优化设置。

【讨论】：

查看 VS2013 的汇编代码，testForLoop 实现很好地展开了 SSE 代码，其中 testFor 代码只是一个简单的加法循环，因此 testForLoop 更快。关闭 SSE 会使两者的汇编代码和运行时几乎相同。

【解决方案4】：

如果您正在使用：

for ( auto x : ... )

那么每个 x 都是一个副本。更少的开销可能是：

for ( const auto & x : ... )

【讨论】：

我用于(const int &x : )