控制台输出顺序减慢多线程程序答案

【问题标题】：Console output order slows down multi-threaded program控制台输出顺序减慢多线程程序
【发布时间】：2013-08-04 16:48:34
【问题描述】：

编译以下代码时

#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>

std::mutex cout_mut;

void task()
{
    for(int i=0; i<10; i++)
    {
        double d=0.0;
        for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;

        std::lock_guard<std::mutex> lg(cout_mut);
        std::cout << d << "(Help)" << std::endl;
        //        std::cout << "(Help)" << d << std::endl;
    }
}

int main()
{
    std::vector<std::thread> all_t(std::thread::hardware_concurrency());

    auto t_begin = std::chrono::high_resolution_clock::now();

    for(auto& t : all_t) t = std::thread{task};
    for(auto& t : all_t) t.join();

    auto t_end = std::chrono::high_resolution_clock::now();

    std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}

在 MinGW 4.8.1 下，在我的盒子上执行大约需要 2.5 秒。这大约是仅单线程执行task 函数所需的时间。

但是，当我取消注释中间的行并因此注释掉之前的行时（也就是说，当我交换将d 和"(Help)" 写入std::cout 的顺序时）整个事情现在需要8-9 秒。

解释是什么？

我再次测试，发现我只有 MinGW-build x32-4.8.1-win32-dwarf-rev3 的问题，而 MinGW build x64-4.8.1-posix-seh-rev3 没有问题。我有一台 64 位的机器。使用 64 位编译器，两个版本都需要三秒钟。但是，使用 32 位编译器，问题仍然存在（并且不是由于发布/调试版本混淆）。

【问题讨论】：

会不会是因为某种原因，您从发布版本中获得了快的时间，而从调试版本中获得了慢的时间？与 gcc 4.7.2/4.8.1、clang 3.2/3.3（所有 Linux）的交换我没有得到任何显着差异。但是对于发布版本，我大约需要 2 秒，而对于调试，我大约需要 8 秒。与您看到的比率明显相似（在 Windows 上，在其他机器上）。
不，这不是调试/发布问题；请看我上面的编辑。
32 位的 x87 fpu 和 64 位的 SSE2 似乎有所不同。 32 位编译器可能会生成（慢）double 操作的 x87 指令代码。
看起来 gcc 4.8.1 为内部循环生成了非常不同的程序集。在一种情况下，它只是将常数添加到 xmm0 中。在另一种情况下，中间结果在堆栈上累加，除了 addd 之外，还会生成两条额外的 movsd 指令。

标签： c++11 concurrency mutex stdthread

【解决方案1】：

它与多线程无关。这是循环优化的问题。我重新排列了原始代码，以获得一些简单的东西来证明这个问题：

#include <iostream>
#include <chrono>
#include <mutex>

int main()
{
    auto t_begin = std::chrono::high_resolution_clock::now();
    for(int i=0; i<2; i++)
    {
        double d=0.0;
        for(int j=0; j<100000; j++) d += 1.23456;
        std::mutex mutex;
        std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
        std::cout << 'a' << d << std::endl;
#else
        std::cout << d << 'a' << std::endl;
#endif
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}

编译和执行时：

g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

输出是：

a123456
a123456
Took : 931
123456a
123456a
Took : 373

大部分时间上的差异可以通过为内部循环生成的汇编代码来解释：快速情况直接在 xmm0 中累积，而慢速情况在 xmm1 中累积 - 导致 2 个额外的 movsd 指令。

现在，当使用“-ftree-loop-linear”选项编译时：

g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

输出变成：

a123456
a123456
Took : 340
123456a
123456a
Took : 346

【讨论】：