c# 到 c++ 字典到 unordered_map 结果答案

【问题标题】：c# to c++ dictionary to unordered_map resultsc# 到 c++ 字典到 unordered_map 结果
【发布时间】：2011-10-21 22:17:55
【问题描述】：

我现在已经做了几年的 c#，我正在尝试学习一些新东西。所以我决定看看 c++，以不同的方式了解编程。

我一直在做大量的阅读，但我今天才开始写一些代码。

在我的 Windows 7/64 位机器上，运行 VS2010，我创建了两个项目： 1) 一个 c# 项目，让我可以按照我习惯的方式编写东西。 2) 一个 c++“makefile”项目，让我玩弄，试图实现同样的事情。据我了解，这不是 .NET 项目。

我不得不尝试用 10K 值填充字典。由于某种原因，c++ 的速度要慢几个数量级。

这是下面的 c#。 请注意，我在时间测量后放入了一个函数，以确保编译器不会“优化”它：

var freq = System.Diagnostics.Stopwatch.Frequency;

int i;
Dictionary<int, int> dict = new Dictionary<int, int>();
var clock = System.Diagnostics.Stopwatch.StartNew();

for (i = 0; i < 10000; i++)
     dict[i] = i;
clock.Stop();

Console.WriteLine(clock.ElapsedTicks / (decimal)freq * 1000M);
Console.WriteLine(dict.Average(x=>x.Value));
Console.ReadKey(); //Don't want results to vanish off screen

这里是c++，没怎么考虑过（努力学习吧？）整数输入；

LARGE_INTEGER frequency;        // ticks per second
LARGE_INTEGER t1, t2;           // ticks
double elapsedTime;

// get ticks per second
QueryPerformanceFrequency(&frequency);

int i;
boost::unordered_map<int, int> dict;
// start timer
QueryPerformanceCounter(&t1);

for (i=0;i<10000;i++)
    dict[i]=i;

// stop timer
QueryPerformanceCounter(&t2);

// compute and print the elapsed time in millisec
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
cout << elapsedTime << " ms insert time\n";
int input;
cin >> input; //don't want console to disappear

现在，一些警告。 I managed to find this related SO question. 其中一个人写了一个很长的答案，提到 WOW64 扭曲了结果。我已将项目设置为发布，并浏览了 c++ 项目的“属性”选项卡，启用所有听起来会加快速度的东西。将平台更改为 x64，但我不确定这是否解决了他的 wow64 问题。我对编译器选项不是很有经验，也许你们有更多的线索？

哦，结果：c#:0.32ms c++:8.26ms。这有点奇怪。我是否误解了 .Quad 的含义？我从网络上的某个地方复制了 c++ 计时器代码，完成了所有 boost 安装和包含/libfile rigmarole。或者也许我实际上在不知不觉中使用了不同的工具？或者有一些我没有使用过的关键编译选项？或者可能因为平均值是常数而优化了 c# 代码？

这是 c++ 命令行，来自属性页->C/C++->命令行： /I"C:\Users\Carlos\Desktop\boost_1_47_0" /Zi /nologo /W3 /WX- /MP /Ox /Oi /Ot /GL /D "_MBCS" /Gm- /EHsc /GS- /Gy- / arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope /Fp"x64\Release\MakeTest.pch" /Fa"x64\Release\" /Fo"x64\Release\" /Fd"x64\Release\vc100 .pdb" /gd /errorReport:队列

任何帮助将不胜感激，谢谢。

【问题讨论】：

你试过用 std::map 代替 boost::unordered_map 吗？
不要太相信其他答案。特别是他对 WOW64 的评论完全不合时宜，系统调用可能会受到惩罚（尽管我认为这并不重要）但绝对不是数学。 x86 FPU 代码在 WOW64 上的运行速度与在 32 位处理器上一样快。该答案中大约有一半的其他内容也是不合格的。
是的，我试过 map，然后我读到它更类似于 SortedDictionary。玩过各种类型，没有区别。
我根本无法重现该结果。我的简单 C++0x 实现始终比您的 C# 版本快很多。我正在使用带有-O3 -march=native 和gmcs 的GCC 4.6.1。

标签： c# c++ visual-studio performance collections

【解决方案1】：

存储以升序添加的连续数字整数键序列绝对不是哈希表的优化对象。

使用数组，否则生成随机值。

并进行一些检索。哈希表针对检索进行了高度优化。

【讨论】：

@Duck：您通常不会尝试详细了解为什么使用错误的数据结构会导致性能不佳，而是切换到正确的数据结构。
我不同意。了解数据结构在内部是如何工作的，可以让您更有能力知道在其他情况下使用什么。另外，如果我们不知道 .NET Dictionary<> 使用什么——也许它也使用哈希表，这仍然意味着您的答案没有完全解决 OP 描述的性能差异。
@Duck：我们知道System.Collections.Generic.Dictionary'2 也使用哈希表。但是这些结构应该针对快速查找进行优化。如果您想要快速插入，您将使用为此优化的数据结构。如果你想比较哈希表，你应该做他们打算做的操作（查找）。
仅供参考，我还没有完成练习的查找部分。我想如果我不能让它们在插入时达到相似的数量级，那么就已经有问题了。
@Carlos：为什么？这就像说在实际任务中查看 Java 或 C# 的性能是没有意义的，因为到达用户代码的第一行比 C++ 慢 1000 倍。

【解决方案2】：

Visual Studio TR1 unordered_map 与 stdext::hash_map 相同：

另一个线程询问为什么它执行缓慢，请参阅我的答案以及发现相同问题的其他人的链接。结论是在 C++ 中使用另一个 hash_map 实现：

Alternative to stdext::hash_map for performance reasons

顺便说一句。请记住，在 C++ 中，优化的 Release-build 和未优化的 Debug-build 与 C# 相比存在很大差异。

【讨论】：

哎呀只是读了 unordered_map，没有读代码。忽略我吧:)

【解决方案3】：

您可以在插入元素之前尝试dict.rehash(n) 与n 的不同（大）值，看看这对性能有何影响。内存分配（它们发生在容器填满桶时）在 C++ 中通常比在 C# 中更昂贵，并且重新散列也很繁重。对于std::vector 和std::deque，模拟成员函数为reserve。

不同的 rehash 策略和负载因子阈值（查看max_load_factor 成员函数）也会极大地影响unordered_map 的性能。

接下来，由于您使用的是 VS2010，我建议您使用来自 <unordered_map> 标头的 std::unordered_map。可以使用标准库时不要使用boost。

实际使用的哈希函数可能会极大地影响性能。您可以尝试以下方法：

struct custom_hash { size_t operator()(int x) const { return x; } };

并使用std::unordered_map<int, int, custom_hash>。

最后，我同意这是对哈希表的不良使用。使用随机值进行插入，您将更准确地了解正在发生的事情。测试哈希表的插入速度一点也不傻，但哈希表并不是要存储连续的整数。为此使用vector。

【讨论】：

啊，我没有意识到 std 有相同的类型

【解决方案4】：

一个简单的分配器更改将大大减少时间。

boost::unordered_map<int, int, boost::hash<int>, std::equal_to<int>, boost::fast_pool_allocator<std::pair<const int, int>>> dict;

在我的系统上为 0.9 毫秒（从之前的 10 毫秒开始）。这向我表明，实际上，您的绝大多数时间根本没有花在哈希表上，而是花在分配器上。这是一个不公平的比较的原因是因为你的 GC 永远不会在这样一个微不足道的程序中收集，这给它带来了不应有的性能优势，并且本机分配器对空闲内存进行了大量缓存——但这永远不会在这样一个微不足道的程序中发挥作用例如，因为你从来没有分配或释放任何东西，所以没有什么可以缓存。

最后，Boost pool 实现是线程安全的，而您从不使用线程，因此 GC 可以回退到单线程实现，这样会快得多。

我使用了一个手动的、非释放的、非线程安全的池分配器，并将 C++ 的 0.525 毫秒降至 C# 的 0.45 毫秒（在我的机器上）。结论：由于两种语言的内存分配方案不同，您的原始结果非常偏斜，一旦解决，差异就变得相对较小。

自定义哈希器（如 Alexandre 的回答中所述）将我的 C++ 时间降低到 0.34 毫秒，现在比 C# 更快。

static const int MaxMemorySize = 800000;
static int FreedMemory = 0;
static int AllocatorCalls = 0;
static int DeallocatorCalls = 0;
template <typename T>
class LocalAllocator
{
  public:
      std::vector<char>* memory;
      int* CurrentUsed;
      typedef T value_type;
      typedef value_type * pointer;
      typedef const value_type * const_pointer;
      typedef value_type & reference;
      typedef const value_type & const_reference;
      typedef std::size_t size_type;
      typedef std::size_t difference_type;

    template <typename U> struct rebind { typedef LocalAllocator<U> other; };

    template <typename U>
    LocalAllocator(const LocalAllocator<U>& other) {
        CurrentUsed = other.CurrentUsed;
        memory = other.memory;
    }
    LocalAllocator(std::vector<char>* ptr, int* used) {
        CurrentUsed = used;
        memory = ptr;
    }
    template<typename U> LocalAllocator(LocalAllocator<U>&& other) {
        CurrentUsed = other.CurrentUsed;
        memory = other.memory;
    }
    pointer address(reference r) { return &r; }
    const_pointer address(const_reference s) { return &r; }
    size_type max_size() const { return MaxMemorySize; }
    void construct(pointer ptr, value_type&& t) { new (ptr) T(std::move(t)); }
    void construct(pointer ptr, const value_type & t) { new (ptr) T(t); }
    void destroy(pointer ptr) { static_cast<T*>(ptr)->~T(); }

    bool operator==(const LocalAllocator& other) const { return Memory == other.Memory; }
    bool operator!=(const LocalAllocator&) const { return false; }

    pointer allocate(size_type count) {
        AllocatorCalls++;
        if (*CurrentUsed + (count * sizeof(T)) > MaxMemorySize)
            throw std::bad_alloc();
        if (*CurrentUsed % std::alignment_of<T>::value) {
            *CurrentUsed += (std::alignment_of<T>::value - *CurrentUsed % std::alignment_of<T>::value);
        }
        auto val = &((*memory)[*CurrentUsed]);
        *CurrentUsed += (count * sizeof(T));
        return reinterpret_cast<pointer>(val);
    }
    void deallocate(pointer ptr, size_type n) {
        DeallocatorCalls++;
        FreedMemory += (n * sizeof(T));
    }

    pointer allocate() {
        return allocate(sizeof(T));
    }
    void deallocate(pointer ptr) {
        return deallocate(ptr, 1);
    }
};
int main() {
    LARGE_INTEGER frequency;        // ticks per second
    LARGE_INTEGER t1, t2;           // ticks
    double elapsedTime;

    // get ticks per second
    QueryPerformanceFrequency(&frequency);
    std::vector<char> memory;
    int CurrentUsed = 0;
    memory.resize(MaxMemorySize);

    struct custom_hash {
        size_t operator()(int x) const { return x; }
    };
    boost::unordered_map<int, int, custom_hash, std::equal_to<int>, LocalAllocator<std::pair<const int, int>>> dict(
        std::unordered_map<int, int>().bucket_count(),
        custom_hash(),
        std::equal_to<int>(),
        LocalAllocator<std::pair<const int, int>>(&memory, &CurrentUsed)
    );

    // start timer
    std::string str;
    QueryPerformanceCounter(&t1);

    for (int i=0;i<10000;i++)
        dict[i]=i;

    // stop timer
    QueryPerformanceCounter(&t2);

    // compute and print the elapsed time in millisec
    elapsedTime = ((t2.QuadPart - t1.QuadPart) * 1000.0) / frequency.QuadPart;
    std::cout << elapsedTime << " ms insert time\n";
    int input;
    std::cin >> input; //don't want console to disappear
}

【讨论】：

确实如此。很可能每次插入都会在桶中创建一个新节点。现代垃圾回收语言中的小对象分配通常只是一个指针增量，而 C++ 中的默认分配器必须找到其进入复杂数据结构的方式。
刚回来看这个。现在可以工作了，效果很好。（只有 1 个问题：您在某些地方将“内存”大写，而在其他地方使用小写。）