为什么使用默认构造函数“{}”而不是“= default”会有性能变化？答案

【问题标题】：Why is there performance variation using default constructor "{}" instead of "= default"?为什么使用默认构造函数“{}”而不是“= default”会有性能变化？
【发布时间】：2019-12-10 06:34:01
【问题描述】：

我最近注意到我的性能受到了影响，因为我声明了一个默认构造函数，例如：

Foo() = default;

而不是

Foo() {}

（仅供参考，我需要显式声明它，因为我还有一个可变参数构造函数，否则会覆盖默认构造函数）

这对我来说似乎很奇怪，因为我认为这两行代码是相同的（好吧，只要默认构造函数是可能的。如果默认构造函数是不可能的，第二行代码会产生错误并且第一个将隐式删除默认构造函数。'不是我的情况！）。

好的，所以我做了一个小测试器，结果因编译器而异，但在某些设置下，我得到一致的结果，一个比另一个更快：

#include <chrono>

template <typename T>
double TimeDefaultConstructor (int n_iterations)
{
    auto start_time = std::chrono::system_clock::now();

    for (int i = 0; i < n_iterations; ++i)
        T t;

    auto end_time = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end_time - start_time;

    return elapsed_seconds.count();
}

template <typename T, typename S>
double CompareDefaultConstructors (int n_comparisons, int n_iterations)
{
    int n_comparisons_with_T_faster = 0;

    for (int i = 0; i < n_comparisons; ++i)
    {
        double time_for_T = TimeDefaultConstructor<T>(n_iterations);
        double time_for_S = TimeDefaultConstructor<S>(n_iterations);

        if (time_for_T < time_for_S)    
            ++n_comparisons_with_T_faster;  
    }

    return (double) n_comparisons_with_T_faster / n_comparisons;
}


#include <vector>

template <typename T>
struct Foo
{
    std::vector<T> data_;

    Foo() = default;
};

template <typename T>
struct Bar
{
    std::vector<T> data_;

    Bar() {};
};

#include <iostream>

int main ()
{
    int n_comparisons = 10000;
    int n_iterations = 10000;

    typedef int T;

    double result = CompareDefaultConstructors<Foo<T>,Bar<T>> (n_comparisons, n_iterations);

    std::cout << "With " << n_comparisons << " comparisons of " << n_iterations
        << " iterations of the default constructor, Foo<" << typeid(T).name() << "> was faster than Bar<" << typeid(T).name() << "> "
        << result*100 << "% of the time" << std::endl;

    std::cout << "swapping orientation:" << std::endl;

    result = CompareDefaultConstructors<Bar<T>,Foo<T>> (n_comparisons, n_iterations);

    std::cout << "With " << n_comparisons << " comparisons of " << n_iterations
        << " iterations of the default constructor, Bar<" << typeid(T).name() << "> was faster than Foo<" << typeid(T).name() << "> "
        << result*100 << "% of the time" << std::endl;

    return 0;
}

将上述程序与g++ -std=c++11 一起使用，我始终得到类似于以下内容的输出：

10000 次迭代的 10000 次比较默认构造函数，Foo 比 Bar 快 4.69% 的时间交换方向：通过 10000 次迭代的 10000 次比较默认构造函数，Bar 比 Foo 快 96.23% 时间

更改编译器设置似乎会改变结果，有时会完全翻转。但我无法理解的是为什么它很重要？

【问题讨论】：

使用system_clock 来计时并不是一个好主意。
@NicolBolas，我对时间的准确性不感兴趣。我感兴趣的是 Foo可以始终比 Bar 具有更好的性能（反之亦然）。时钟足以表明这一点。
您是否测试了优化的构建？否则，您的结果将毫无意义。
未优化的编译不是为性能而设计的。因此，衡量未优化代码的性能是一种无用的娱乐形式。
@n.'pronouns'm。我想你不明白我想做什么。我认为在 C++ 中声明默认构造函数的两种不同方式是相同的，但即使没有优化器，我也看到了性能差异。在这个阶段，我对性能本身并不真正感兴趣，但性能差异向我证明了两个默认构造函数似乎并不相同。

标签： c++ performance constructor compiler-optimization default-constructor

【解决方案1】：

此基准不衡量它应该衡量的内容。将Bar() {}; 替换为Bar() = default; 使Foo 和Bar 相同，您将得到相同的结果：

对于默认构造函数的 10000 次迭代进行 10000 次比较，Foo 比 Bar 快 69.89% 的时间交换方向：对默认构造函数的 10000 次迭代进行 10000 次比较，Bar 在 29.9% 的时间里比 Foo 快

这是一个生动的演示，表明您测量的不是构造函数，而是其他东西。

启用-O1 优化时，for 循环与T t; 退化为¹：

        test    ebx, ebx
        jle     .L3
        mov     eax, 0
.L4:
        add     eax, 1
        cmp     ebx, eax
        jne     .L4
.L3:

对于Foo 和Bar。也就是说，进入一个微不足道的for (int i = 0; i < n_iterations; ++i); 循环。

当您启用 -O2 或 -O3 时，它会完全优化。

如果不进行优化 (-O0)，您将得到以下程序集：

        mov     DWORD PTR [rbp-4], 0
.L35:
        mov     eax, DWORD PTR [rbp-4]
        cmp     eax, DWORD PTR [rbp-68]
        jge     .L34
        lea     rax, [rbp-64]
        mov     rdi, rax
        call    Foo<int>::Foo()
        lea     rax, [rbp-64]
        mov     rdi, rax
        call    Foo<int>::~Foo()
        add     DWORD PTR [rbp-4], 1
        jmp     .L35
.L34:

对于Bar 也是如此，将Foo 替换为Bar。

现在让我们看看构造函数：

Foo<int>::Foo()
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     QWORD PTR [rbp-8], rdi
        mov     rax, QWORD PTR [rbp-8]
        mov     rdi, rax
        call    std::vector<int, std::allocator<int> >::vector()
        nop
        leave
        ret

和

Bar<int>::Bar()
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     QWORD PTR [rbp-8], rdi
        mov     rax, QWORD PTR [rbp-8]
        mov     rdi, rax
        call    std::vector<int, std::allocator<int> >::vector()
        nop
        leave
        ret

如您所见，它们也是相同的。

¹ GCC 8.3

【讨论】：

@Elliott-ReinstateMonica 优化器不会以不同方式优化两个构造函数。它完全优化。答案表明生成的代码对于所有优化级别都是相同的。您测量的问题在于噪声在信号中占主导地位。
@Elliott-ReinstateMonica，它们之前是一样的，之后是一样的。在现代 CPU 上，即使是相同的汇编代码也可能有不同的计时。
@Evg，非常感谢。好的。在看了两个小时之后，我对这两个构造函数的最初想法似乎是正确的：它们是相同的。将来我应该学会使用汇编来回答这些问题。
@Elliott-ReinstateMonica，https://godbolt.org 将成为你的好朋友。
@Elliott：仅供参考：它们并不相同。 = default 构造函数可能是微不足道的（取决于成员子对象），而 {} 构造函数 never 将是微不足道的。

【解决方案2】：

Foo() = default; 和 Foo() {}; 是不同的。前者是微不足道的默认构造函数，而后者是默认构造函数的自定义版本，除了默认内容之外什么都不做。

这可以通过 type_traits 观察到。这样的更改可能会影响模板函数解析中选择的分配/构造例程，从而导致使用完全不同的代码。

虽然这对于默认构造函数应该无关紧要 - 对于复制构造函数/赋值，它可能会发生很大变化。所以= default 是首选。

【讨论】：

谢谢。 “这可以通过 type_traits 观察到”是什么意思。怎么样？
Foo() = default; 不是微不足道的。
@Elliott-ReinstateMonica 有一些函数可以测试类型的各种属性。例如。 std::is_default_constructible 或 std::is_trivially_copyable。有一些测试可以识别它。
@ALX23z：我认为他在问一个更具体的问题——不是一般的类型特征做什么，而是什么特定的类型特征可以检测空默认 ctor 和显式默认默认 ctor 之间的区别这节课。在这种情况下，使用它们中的任何一个都不是微不足道的，而是默认可构造的。
latter 是默认构造函数的自定义版本，除了默认的东西之外什么都不做 - 这很有趣但也很不清楚。你能指定默认的东西吗？

【解决方案3】：

我怀疑你认为你看到的速度差异主要是时间差的副产品，而不是真实的。

为了查看生成的结果，我稍微简化了您的代码，只留下以下内容：

#include <vector>

template <typename T>
struct Foo
{
    std::vector<T> data_;

    Foo() = default;
};

template <typename T>
struct Bar
{
    std::vector<T> data_;

    Bar() {};
};

int main() { 
    Foo<int> f;

    Bar<int> b;
}

然后我将on Godbolt 放入以便于查看生成的代码。

gcc 9.2 似乎为两个 ctor 生成了相同的代码，在这两种情况下看起来都是这样的：

push    rbp
mov     rbp, rsp
sub     rsp, 16
mov     QWORD PTR [rbp-8], rdi
mov     rax, QWORD PTR [rbp-8]
mov     rdi, rax
call    std::vector<int, std::allocator<int> >::vector() [complete object constructor]
nop
leave
ret

Clang 生成的代码略有不同，但（再次）两个类相同：

push    rbp
mov     rbp, rsp
sub     rsp, 16
mov     qword ptr [rbp - 8], rdi
mov     rdi, qword ptr [rbp - 8]
call    std::vector<int, std::allocator<int> >::vector() [base object constructor]
add     rsp, 16
pop     rbp
ret

英特尔 icc 几乎相同，为两个类生成以下代码：

push      rbp                                           #8.5
mov       rbp, rsp                                      #8.5
sub       rsp, 16                                       #8.5
mov       QWORD PTR [-16+rbp], rdi                      #8.5
mov       rax, QWORD PTR [-16+rbp]                      #8.5
mov       rdi, rax                                      #8.5
call      std::vector<int, std::allocator<int> >::vector() [complete object constructor]                      #8.5
leave                                                   #8.5
ret

虽然我同意其他人的观点，即在禁用优化的情况下查看性能收效甚微，但在这种情况下，似乎即使禁用优化也不足以（至少对于这三个编译器）来获得用于构造两个类的对象的不同代码.如果有一些编译器和/或优化设置会产生不同的结果，我想我不会感到非常惊讶，但恐怕我没有足够的野心来花更多的时间去寻找它。

【讨论】：

【解决方案4】：

Foo() = default; 是一个平凡的构造函数。

Foo() {} 是用户定义的构造函数，根据定义，用户定义的构造函数即使为空也绝不是微不足道的。

另请参阅：Trivial default constructor 和 std::is_trivial。

预计当编译器优化启用时，一个普通的构造函数可能比用户提供的更快。

【讨论】：

有趣。实际上，当我的优化器被禁用时，我看到了性能上的最大差异（因为实际上正在制作对象）。我会读一读然后回到这里。谢谢。
Foo() = default; 一点也不简单。它是默认的。 static_assert(!std::is_trivially_default_constructible_v<Foo<int>>);.
@Elliot 在未启用优化器时尝试推理性能通常是毫无意义的。编译器和开始日期库都插入了各种调试检查，有时还会做额外的工作来将变量初始化为零和许多其他降低性能的东西，并根据什么是快什么是慢来倾斜天平。不要浪费时间测量调试版本；它们不能准确反映发布版本的性能。
@JesperJuhl 谢谢。我的逻辑是我想在最简单的问题上试验性能，所以我不希望编译器在优化过程中因为认识到它是无用的而丢弃我的代码。我还认为如果不进行优化，理解编译器在做什么会更容易理解......