标准堆栈性能问题[关闭]答案

【问题标题】：std stack performance issues [closed]标准堆栈性能问题[关闭]
【发布时间】：2012-09-24 03:13:29
【问题描述】：

最近我尝试做一些性能基准测试，比较 std::stack<int, std::vector<int>> 和我自己的简单堆栈实现（使用预分配内存）。现在我遇到了一些奇怪的行为。

我想问的第一件事是堆栈基准代码中的这一行：

//  std::vector<int> magicVector(10);

当我取消注释这条线时，性能提高了大约 17%（基准时间从 6.5 秒下降到 5.4 秒）。但是该行应该对程序的其余部分没有影响，因为它不会修改任何其他成员。另外，不管是int的vector还是double的vector……

我想问的第二件事是我的堆栈实现和std::stack 之间的巨大性能差异。有人告诉我std::stack 应该和我的堆栈一样快，但结果显示我的“FastStack”快两倍。

结果（带有未注释的性能增长线）：
堆栈 5.38979
堆栈 5.34406
堆栈 5.32404
堆栈 5.30519
FastStack 2.59635
FastStack 2.59204
FastStack 2.59713
快速堆栈 2.64814

这些结果来自带有 /O2、/Ot、/Ob2 和其他默认优化的 VS2010 的发布版本。我的 CPU 是 Intel i5 3570k，默认时钟（一个线程 3.6 GHz）。

我将所有代码放在一个文件中，以便任何人都可以轻松测试它。

#define _SECURE_SCL 0

#include <iostream>
#include <vector>
#include <stack>
#include <Windows.h>

using namespace std;

//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    High Resolution Timer
//---------------------------------------------------------------------------------

class HRTimer
{
public:
    HRTimer();
    double GetFrequency(void);
    void Start(void) ;
    double Stop(void);
    double GetTime();

private:
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    double frequency;
};

HRTimer::HRTimer()
{
    frequency = this->GetFrequency();
}

double HRTimer::GetFrequency(void)
{
    LARGE_INTEGER proc_freq;
    if (!::QueryPerformanceFrequency(&proc_freq))
        return -1;
    return proc_freq.QuadPart;
}

void HRTimer::Start(void)
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&start);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
}

double HRTimer::Stop(void)
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&stop);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
    return ((stop.QuadPart - start.QuadPart) / frequency);
} 

double HRTimer::GetTime()
{
    LARGE_INTEGER time;
    ::QueryPerformanceCounter(&time);
    return time.QuadPart / frequency;
}

//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    Should be faster than std::stack
//---------------------------------------------------------------------------------

template <class T>

class FastStack
{
public:
    T* st;
    int allocationSize;
    int lastIndex;

public:
    FastStack(int stackSize);
    ~FastStack();

    inline void resize(int newSize);
    inline void push(T x);
    inline void pop();
    inline T getAndRemove();
    inline T getLast();
    inline void clear();
};

template <class T>
FastStack<T>::FastStack( int stackSize )
{
    st = NULL;
    this->allocationSize = stackSize;
    st = new T[stackSize];
    lastIndex = -1;
}

template <class T>
FastStack<T>::~FastStack()
{
    delete [] st;
}

template <class T>
void FastStack<T>::clear()
{
    lastIndex = -1;
}

template <class T>
T FastStack<T>::getLast()
{
    return st[lastIndex];
}

template <class T>
T FastStack<T>::getAndRemove()
{
    return st[lastIndex--];
}

template <class T>
void FastStack<T>::pop()
{
    --lastIndex;
}

template <class T>
void FastStack<T>::push( T x )
{
    st[++lastIndex] = x;
}

template <class T>
void FastStack<T>::resize( int newSize )
{
    if (st != NULL)
        delete [] st;
    st = new T[newSize];
}
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    Benchmark of std::stack and FastStack
//---------------------------------------------------------------------------------


int main(int argc, char *argv[])
{
#if 1
    for (int it = 0; it < 4; it++)
    {
        std::stack<int, std::vector<int>> bStack;
        int x;

        for (int i = 0; i < 100; i++)   // after this two loops, bStack's capacity will be 141 so there will be no more reallocating
            bStack.push(i);
        for (int i = 0; i < 100; i++)
            bStack.pop();
    //  std::vector<int> magicVector(10);           // when you uncomment this line, performance will magically rise about 18%

        HRTimer timer;
        timer.Start();

        for (int i = 0; i < 2000000000; i++)
        {
            bStack.push(i);
            x = bStack.top();
            if (i % 100 == 0 && i != 0)
                for (int j = 0; j < 100; j++)
                    bStack.pop();
        }

        double totalTime = timer.Stop();
        cout << "stack " << totalTime << endl;
    }
#endif

    //------------------------------------------------------------------------------------

#if 1
    for (int it = 0; it < 4; it++)
    {
        FastStack<int> fstack(200);
        int x;

        HRTimer timer;
        timer.Start();

        for (int i = 0; i < 2000000000; i++)
        {
            fstack.push(i);
            x = fstack.getLast();
            if (i % 100 == 0 && i != 0)
                for (int j = 0; j < 100; j++)
                    fstack.pop();
        }

        double totalTime = timer.Stop();
        cout << "FastStack " << totalTime << endl;
    }
#endif

    cout << "Done";
    cin.get();
    return 0;
}

.
编辑： 因为每个人都在谈论我的堆栈实现非常糟糕，所以我想把事情做好。我在几分钟内创建了该堆栈，并且只实现了我当前需要的几个功能。它从来都不是要替代 std::stack :) 或保存以在所有情况下使用。唯一的目标是达到最大速度和正确的结果。对这个误会很抱歉……我只想知道几个答案……

【问题讨论】：

在向您指出后约 4 小时，您使用免责声明修改了您的问题，该免责声明涉及您的错误实施。这当然是足够的时间来修复实现，使所有指出其缺陷的 cmets 过时，并将讨论带回到性能问题上。你已经决定在其他地方玩，所以我会投票结束这个问题作为“愚蠢的基准尝试”。哦等等，这不存在。所以“没有建设性”将是：“我们希望答案得到事实、参考资料或特定专业知识的支持。”我觉得很合适。
@sbi 你已经投票结束了这个问题，现在它已经关闭了，所以冷静一下:)
@sbi 我为什么要更改该实现？即使是这个“损坏”的版本也能满足我的需求，而且我使用它的解决方案工作得非常好，没有任何例外，但有明显的提升。它不应该是完美的，它被制造得很快。
仅仅因为您没有遇到当前代码中的错误，这不是忽略这些错误的好理由。（但学习这一点需要痛苦的经验。）另外，有人会认为让 cmets 过时并指出其缺陷并将讨论带回性能问题可能是足够的理由。无论如何，我主要想解释为什么我认为这个值得接近，因为有两个遵循我的推理，现在它缺乏对问题本身的所有讨论，这个问题引起辩论、争论和扩展的问题可能并不那么明显讨论。

标签： c++ windows performance stack

【解决方案1】：

你的方法实现都坏了。忽略复制构造函数和其他丢失的操作，如果您推送太多，您的 push 会调用 UB，并且您的 resize 显然已损坏，因为它不会复制以前的数据并且它不是异常安全的并且您的推送不是异常安全并且您调用了太多副本并且您的getAndRemove 不是异常安全并且你不会破坏弹出的元素并且你没有正确构造新元素，只分配它们和你在创建时不必要的默认构造，并且有可能更多我还没有找到。

基本上，您的课程在所有可以想象的方面都极其不安全，会立即破坏用户的数据，在T 上调用所有错误的函数，并且一旦出现异常就会在角落里哭泣扔到任何地方。

这是一大堆糟糕的事情，它比std::stack“更快”这一事实完全无关紧要，因为你已经证明，如果你不必满足要求，你可以去随心所欲，我们都知道。

从根本上说，正如 sbi 所说，您显然不了解 std::stack 的语义，也不了解异常安全等重要的 C++ 方面，并且您的代码无法正常工作的方式是使其执行速度更快的原因。我的朋友，你还有很长的路要走。

【讨论】：

+1 我读过的 OP 代码的最佳解构。 :P
@klerik duh，你的堆栈做的事情与 std 堆栈不同（基本上，不同之处在于你的堆栈在最轻微的微风中崩溃，而标准堆栈只是工作。这就是为什么他们没有相同的性能特点有人曾经说过，做一个快速输出垃圾的程序很容易。
-1 对安全的担忧是无关紧要的。要求什么。安全性不会影响代码的性能。
确实如此。调整大小时不必复制可以节省周期。不必检查边界可以节省周期。他只是通过不实现相同的功能来节省时间。
缺乏安全检查确实很重要，这就像问为什么没有鸡蛋或黄油的蛋糕味道不一样

【解决方案2】：

与使用std::vector 的std::stack 相反，您的堆栈在空间不足时不会重新分配，而只会炸毁地球。然而，分配会极大地消耗性能，因此跳过它肯定会提高性能。

但是，在您的位置上，我会使用一个成熟的static_vector 实现floating on the web 并将其填充到std::stack 中，而不是std::vector。这样一来，您就可以跳过所有需要性能的动态内存处理，但您有一个有效的堆栈实现，其下方有一个用于内存处理的容器，非常可能比您想出的要好得多。

【讨论】：

+1 for static_vector（可以使用std::array，还是un-container-like？）
@sehe AFAIK std::array，因为设计用作静态数组，没有push_back()等，所以不能使用。那些static_vector 的东西，OTOH，是std::vector 的替代品，所以它们支持整个界面。

【解决方案3】：

许多 cmets（甚至答案）都关注您实施中的风险。然而问题仍然存在。

正如下面直接展示的那样，纠正感知到的代码缺陷将不会改变任何与性能有关的重大事项。

这里是 OP 的代码修改为 (A) 安全，(B) 支持与 std::stack 相同的操作，以及 (C) 也为 std::stack 保留缓冲区空间，以便为那些澄清事情谁错误地认为这些东西对性能很重要：

#define _SECURE_SCL 0
#define _SCL_SECURE_NO_WARNINGS

#include <algorithm>        // std::swap
#include <iostream>
#include <vector>
#include <stack>
#include <stddef.h>         // ptrdiff_t
#include <type_traits>      // std::is_pod
using namespace std;

#undef UNICODE
#define UNICODE
#include <Windows.h>

typedef ptrdiff_t   Size;
typedef Size        Index;

template< class Type, class Container >
void reserve( Size const newBufSize, std::stack< Type, Container >& st )
{
    struct Access: std::stack< Type, Container >
    {
        static Container& container( std::stack< Type, Container >& st )
        {
            return st.*&Access::c;
        }
    };

    Access::container( st ).reserve( newBufSize );
}

class HighResolutionTimer
{
public:
    HighResolutionTimer();
    double GetFrequency() const;
    void Start() ;
    double Stop();
    double GetTime() const;

private:
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    double frequency;
};

HighResolutionTimer::HighResolutionTimer()
{
    frequency = GetFrequency();
}

double HighResolutionTimer::GetFrequency() const
{
    LARGE_INTEGER proc_freq;
    if (!::QueryPerformanceFrequency(&proc_freq))
        return -1;
    return static_cast< double >( proc_freq.QuadPart );
}

void HighResolutionTimer::Start()
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&start);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
}

double HighResolutionTimer::Stop()
{
    DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
    ::QueryPerformanceCounter(&stop);
    ::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
    return ((stop.QuadPart - start.QuadPart) / frequency);
} 

double HighResolutionTimer::GetTime() const
{
    LARGE_INTEGER time;
    ::QueryPerformanceCounter(&time);
    return time.QuadPart / frequency;
}

template< class Type, bool elemTypeIsPOD = !!std::is_pod< Type >::value >
class FastStack;

template< class Type >
class FastStack< Type, true >
{
private:
    Type*   st_;
    Index   lastIndex_;
    Size    capacity_;

public:
    Size const size() const { return lastIndex_ + 1; }
    Size const capacity() const { return capacity_; }

    void reserve( Size const newCapacity )
    {
        if( newCapacity > capacity_ )
        {
            FastStack< Type >( *this, newCapacity ).swapWith( *this );
        }
    }

    void push( Type const& x )
    {
        if( size() == capacity() )
        {
            reserve( 2*capacity() );
        }
        st_[++lastIndex_] = x;
    }

    void pop()
    {
        --lastIndex_;
    }

    Type top() const
    {
        return st_[lastIndex_];
    }

    void swapWith( FastStack& other ) throw()
    {
        using std::swap;
        swap( st_, other.st_ );
        swap( lastIndex_, other.lastIndex_ );
        swap( capacity_, other.capacity_ );
    }

    void operator=( FastStack other )
    {
        other.swapWith( *this );
    }

    ~FastStack()
    {
        delete[] st_;
    }

    FastStack( Size const aCapacity = 0 )
        : st_( new Type[aCapacity] )
        , capacity_( aCapacity )
    {
        lastIndex_ = -1;
    }

    FastStack( FastStack const& other, int const newBufSize = -1 )
    {
        capacity_ = (newBufSize < other.size()? other.size(): newBufSize);
        st_ = new Type[capacity_];
        lastIndex_ = other.lastIndex_;
        copy( other.st_, other.st_ + other.size(), st_ );   // Can't throw for POD.
    }
};

template< class Type >
void reserve( Size const newCapacity, FastStack< Type >& st )
{
    st.reserve( newCapacity );
}

template< class StackType >
void test( char const* const description )
{
    for( int it = 0; it < 4; ++it )
    {
        StackType st;
        reserve( 200, st );

        // after this two loops, st's capacity will be 141 so there will be no more reallocating
        for( int i = 0; i < 100; ++i ) { st.push( i ); }
        for( int i = 0; i < 100; ++i ) { st.pop(); }

        // when you uncomment this line, std::stack performance will magically rise about 18%
        // std::vector<int> magicVector(10);

        HighResolutionTimer timer;
        timer.Start();

        for( Index i = 0; i < 1000000000; ++i )
        {
            st.push( i );
            (void) st.top();
            if( i % 100 == 0 && i != 0 )
            {
                for( int j = 0; j < 100; ++j ) { st.pop(); }
            }
        }

        double const totalTime = timer.Stop();
        wcout << description << ": "  << totalTime << endl;
    }
}

int main()
{
    typedef stack< Index, vector< Index > > SStack;
    typedef FastStack< Index >              FStack;

    test< SStack >( "std::stack" );
    test< FStack >( "FastStack" );

    cout << "Done";
}

这款慢如糖蜜的三星 RC530 笔记本电脑的结果：

[D:\dev\test\so\12704314] > 一个标准：：堆栈：3.21319 标准：：堆栈：3.16456 标准：：堆栈：3.23298 标准：：堆栈：3.20854 快速堆栈：1.97636 快速堆栈：1.97958 快速堆栈：2.12977 快速堆栈：2.13507 完毕 [D:\dev\test\so\12704314] > _

Visual C++ 也是如此。

现在让我们看一下std::vector::push_back 的典型实现，它被std::stack<T, std::vector<T>>::push 调用（顺便说一下，我知道只有3 个程序员曾经使用过这种缩进样式，即PJP、Petzold 和我自己；我现在，从 1998 年左右开始，觉得这太可怕了！）：

void push_back(const value_type& _Val)
    {   // insert element at end
    if (_Inside(_STD addressof(_Val)))
        {   // push back an element
        size_type _Idx = _STD addressof(_Val) - this->_Myfirst;
        if (this->_Mylast == this->_Myend)
            _Reserve(1);
        _Orphan_range(this->_Mylast, this->_Mylast);
        this->_Getal().construct(this->_Mylast,
            this->_Myfirst[_Idx]);
        ++this->_Mylast;
        }
    else
        {   // push back a non-element
        if (this->_Mylast == this->_Myend)
            _Reserve(1);
        _Orphan_range(this->_Mylast, this->_Mylast);
        this->_Getal().construct(this->_Mylast,
            _Val);
        ++this->_Mylast;
        }
    }

我怀疑衡量的低效率至少部分在于那里发生的所有事情，并且也许这也是自动生成的安全检查的问题。

对于调试版本，std::stack 的性能非常糟糕，以至于我放弃了等待任何结果。

编辑：在下面 Xeo 的评论之后，我更新了 push 以检查缓冲区重新分配情况下的“自推”，将其分解为一个单独的函数：

void push( Type const& x )
{
    if( size() == capacity() )
    {
        reserveAndPush( x );
    }
    st_[++lastIndex_] = x;
}

奇怪的是，虽然reserveAndPush 在此测试中从未被调用，但它会影响性能——由于代码大小不适合缓存？

[D:\dev\test\so\12704314] > 一个标准：：堆栈：3.21623 标准：：堆栈：3.30501 标准：：堆栈：3.24337 标准：：堆栈：3.27711 快速堆栈：2.52791 快速堆栈：2.44621 快速堆栈：2.44759 快速堆栈：2.47287 完毕 [D:\dev\test\so\12704314] > _

编辑 2：DeadMG 显示代码一定有问题。我相信问题是缺少return，加上计算新大小的表达式（两次零仍然为零）。他还指出我忘了显示reserveAndPush。应该：

void reserveAndPush( Type const& x )
{
    Type const xVal = x;
    reserve( capacity_ == 0? 1 : 2*capacity_ );
    push( xVal );
}

void push( Type const& x )
{
    if( size() == capacity() )
    {
        return reserveAndPush( x );    // <-- The crucial "return".
    }
    st_[++lastIndex_] = x;
}

【讨论】：

缓慢的很大一部分可能来自检查向量本身的元素是否push_backed 到此向量中。它是必需的，否则v.push_back(v[0]) 会在向量必须重新分配时中断，但堆栈通常不必关心它，除非调用s.push(s.top())。
在您的代码中，由于您的 top 返回一个副本，这也不是问题……但同样无法实现相同的功能。此外，正如我在聊天中提到的，您的 FastStack 不会破坏元素（POD 不需要），这又是与 std::stack 不同的功能。尝试一个自定义分配器，它在被要求构造时简单地分配，并且有一个 noop destroy 方法。
这个答案太长了，归结为倒数第二句话，'std::stack 很慢，因为它做了很多检查'。它也未能处理 OPs 问题的关键点之一，为什么那一行对时间影响如此之大？
OPs 代码中std::stack::top 的usage 可能会复制，但这不是std::stack 固有的，因为它返回来自top 的引用，它可能用作push 的参数。此外，您仍然会破坏 std::stack 案例中的元素。
但我确实确定您的 FastStack 存在错误。我稍微修改了你的代码，FastStack 崩溃了，而std::stack 很好。 Visual Studio 报告堆损坏。 here 是我修改后的测试。