哈希表的性能，为什么C++最慢？答案

【问题标题】：Performance of hash table, why is C++ the slowest?哈希表的性能，为什么C++最慢？
【发布时间】：2016-03-01 06:04:49
【问题描述】：

对 hash 进行了简单的性能测试，似乎 C++ 版本比 perl 版本和 golang 版本都慢。

perl 版本耗时约 200 毫秒，
C++ 版本耗时 280 毫秒。
golang 版本耗时 56 毫秒。

在我的电脑上使用 Core(TM) i7-2670QM CPU @ 2.20GHz，Ubuntu 14.04.3LTS，

有什么想法吗？

perl 版本

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                      clock_gettime clock_getres clock_nanosleep clock
                      stat );
sub getTS {
    my ($seconds, $microseconds) = gettimeofday;
    return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{"U.S."} = "Washington";
$mymap{"U.K."} = "London";
$mymap{"France"} = "Paris";
$mymap{"Russia"} = "Moscow";
$mymap{"China"} = "Beijing";
$mymap{"Germany"} = "Berlin";
$mymap{"Japan"} = "Tokyo";
$mymap{"China"} = "Beijing";
$mymap{"Italy"} = "Rome";
$mymap{"Spain"} = "Madrad";
$x = "";
$start = getTS();
for ($i=0; $i<1000000; $i++) {
    $x = $mymap{"China"};
}
printf "took %f sec\n", getTS() - $start;

C++ 版本

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}
using namespace std;
int main () {
  std::unordered_map<std::string,std::string> mymap;

  // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";  

  double start = getTS();
  string x;
  for (int i=0; i<1000000; i++) {
      mymap["China"];
  }
  printf("took %f sec\n", getTS() - start);
  return 0;
}

Golang 版本

package main

import "fmt"
import "time"

func main() {
    var x string
    mymap := make(map[string]string)
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";
    t0 := time.Now()
    sum := 1
    for sum < 1000000 {
        x = mymap["China"]
        sum += 1
    }
    t1 := time.Now()
    fmt.Printf("The call took %v to run.\n", t1.Sub(t0))
    fmt.Println(x)
}

更新 1

为了改进C++版本，把x = mymap["China"];改成mymap["China"];，但是性能差别很小。

更新 2

我在没有任何优化的情况下编译时得到了原始结果：g++ -std=c++11 unorderedMap.cc。使用“-O2”优化，只需大约一半的时间（150ms）

更新 3

为了删除可能的 char* 到 string 构造函数调用，我创建了一个字符串常量。时间下降到大约 220 毫秒（编译中没有优化）。感谢@neil-kirk 的建议，经过优化（-O2 标志），时间约为 80 毫秒。

  double start = getTS();
  string x = "China";
  for (int i=0; i<1000000; i++) {
      mymap[x];
  }

更新 4

感谢@steffen-ullrich 指出 perl 版本存在语法错误。我改变了它。性能数约为150ms。

更新 5

看来执行指令的数量很重要。使用命令valgrind --tool=cachegrind <cmd>

适用于 Go 版本

$ valgrind --tool=cachegrind  ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103== 
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103== 
==2103== I   refs:      255,763,381
==2103== I1  misses:          3,709
==2103== LLi misses:          2,743
==2103== I1  miss rate:        0.00%
==2103== LLi miss rate:        0.00%
==2103== 
==2103== D   refs:      109,437,132  (77,838,331 rd   + 31,598,801 wr)
==2103== D1  misses:        352,474  (   254,714 rd   +     97,760 wr)
==2103== LLd misses:        149,260  (    96,250 rd   +     53,010 wr)
==2103== D1  miss rate:         0.3% (       0.3%     +        0.3%  )
==2103== LLd miss rate:         0.1% (       0.1%     +        0.1%  )
==2103== 
==2103== LL refs:           356,183  (   258,423 rd   +     97,760 wr)
==2103== LL misses:         152,003  (    98,993 rd   +     53,010 wr)
==2103== LL miss rate:          0.0% (       0.0%     +        0.1%  )

对于 C++ 优化版本（无优化标志）

$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180== 
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180== 
==2180== I   refs:      5,281,474,482
==2180== I1  misses:            1,710
==2180== LLi misses:            1,651
==2180== I1  miss rate:          0.00%
==2180== LLi miss rate:          0.00%
==2180== 
==2180== D   refs:      3,170,495,683  (1,840,363,429 rd   + 1,330,132,254 wr)
==2180== D1  misses:           12,055  (       10,374 rd   +         1,681 wr)
==2180== LLd misses:            7,383  (        6,132 rd   +         1,251 wr)
==2180== D1  miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== 
==2180== LL refs:              13,765  (       12,084 rd   +         1,681 wr)
==2180== LL misses:             9,034  (        7,783 rd   +         1,251 wr)
==2180== LL miss rate:            0.0% (          0.0%     +           0.0%  )

C++优化版

$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157== 
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157== 
==2157== I   refs:      1,451,459,660
==2157== I1  misses:            1,599
==2157== LLi misses:            1,549
==2157== I1  miss rate:          0.00%
==2157== LLi miss rate:          0.00%
==2157== 
==2157== D   refs:        430,486,197  (340,358,108 rd   + 90,128,089 wr)
==2157== D1  misses:           12,008  (     10,337 rd   +      1,671 wr)
==2157== LLd misses:            7,372  (      6,120 rd   +      1,252 wr)
==2157== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== 
==2157== LL refs:              13,607  (     11,936 rd   +      1,671 wr)
==2157== LL misses:             8,921  (      7,669 rd   +      1,252 wr)
==2157== LL miss rate:            0.0% (        0.0%     +        0.0%  )

【问题讨论】：

C++ 实现是否有可能在每次查找时都在构造一个新的std::string？
是的，将键缓存在 for 循环外的本地字符串变量中。
你开启优化了吗？
我实际上并不关心这些基准，因为至少目前 Perl 代码没有做它应该做的事情，即使在它得到“修复”之后也是如此。我不知道其他代码，但它也可能是错误的，或者编译器会优化东西。绝对远离任何可靠的基准。
除此之外：使用运行不到一秒的基准测试是没有用的，因为处理器当前处于哪种电源模式以及哪些进程可能只是并行运行并占用 CPU 时间纯属运气。真正的基准测试运行数小时，以确保基准测试不会受到此类问题的过多影响，

标签： c++ perl go hashtable

【解决方案1】：

来自您的 Perl 代码（在您尝试修复它之前）：

@mymap = ();
$mymap["U.S."] = "Washington";
$mymap["U.K."] = "London";

这不是一个映射，而是一个数组。哈希映射的语法是：

  %mymap;
  $mymap{"U.S."} = ....

因此，您实际上要做的是创建一个数组而不是哈希映射并始终访问元素 0。请在 Perl 中始终使用 use strict; 和 use warnings;，即使是带有警告的简单语法检查也会向您显示问题：

perl -cw x.pl
Argument "U.S." isn't numeric in array element at x.pl line 9.
Argument "U.K." isn't numeric in array element at x.pl line 10.

除此之外，基准测试的主要部分实际上并没有做任何有用的事情（分配一个变量并且从不使用它）并且一些编译器可以检测到它并简单地优化它。

如果您检查 Perl 程序生成的代码，您会看到：

$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = 'Washington';
$mymap[0] = 'London';
...
for ($i = 0; $i < 1000000; ++$i) {
    $x = $mymap[0];
}

也就是说，它在编译时检测到问题，并将其替换为对数组索引 0 的访问。

因此，无论何时进行基准测试，您都需要：

检查您的程序是否确实按照预期执行。
检查编译器没有优化东西，也没有在编译时执行其他语言在运行时执行的东西。任何没有结果或结果可预测的语句都容易出现这种优化。
确认您实际测量的是您打算测量的内容，而不是其他内容。即使是对程序的微小更改也会影响运行时间，因为之前不需要分配内存等，并且这些更改可能与您打算测量的内容无关。在您的基准测试中，您一次又一次地测量对同一个哈希元素的访问，而没有对其他元素之间的任何访问。这是一个可以很好地与处理器缓存配合使用的活动，但可能无法反映现实世界的使用情况。

而且，使用简单的计时器并不是一个现实的基准。系统上还有其他进程，有调度程序，有缓存垃圾......对于今天的 CPU，它高度依赖于系统上的负载，因为 CPU 可能会以比其他基准测试更低的功耗模式运行一个基准测试，即使用不同的 CPU 时钟。例如，同一“基准”的多次运行在我的系统上的测量时间在 100 毫秒到 150 毫秒之间变化。

基准是谎言，像你这样的微观基准更是如此。

【讨论】：

感谢@steffen-ullrich 指出语法。现在已修复，但性能数字没有上升。
@codingFun：不要指望这个数字会上升，因为它实际上应该在代码修复后执行更复杂的操作。就像我说的：基准是谎言，像你这样的微观基准更是如此。阅读有关 CPU、电源模式、调度程序的部分......并根据生成的代码而不是您的输入来验证其他代码的作用。就像我说的，一个聪明的编译器可能会意识到它没有做任何有用的事情并优化它。
@codingFun：除此之外，您的代码仍然是错误的，如果您真的使用我建议的警告，您会看到它。
你是对的。微基准只能给出一个非常粗略的想法。需要更多的研究来选择为项目选择哪种语言。顺便说一句，我在编辑 $x = $mymap["China"]; 时打错了字。现在已经更新了。性能约为 180 毫秒。
@codingFun：正如您所见，在进行小的更改时性能会发生很大变化。只需从$x = $mymap{..} 到my $x = $mymap{...} 的简单更改就会显着影响性能。在我的系统上，程序的多次运行之间存在 50% 的差异。

【解决方案2】：

我对您的示例进行了一些修改，以获取有关哈希表结构的一些详细信息：

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>

using namespace std;
int main()
{
    std::unordered_map<std::string, std::string> mymap;

    // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";

    std::hash<std::string> h;
    for ( auto const& i : mymap )
    {

        printf( "hash(%s) = %ud\n", i.first.c_str(), h( i.first ) );
    }

    for ( int i = 0; i != mymap.bucket_count(); ++i )
    {
        auto const bucketsize = mymap.bucket_size( i );
        if ( 0 != bucketsize )
        {
            printf( "size of bucket %d = %d\n", i, bucketsize );
        }
    }

    auto const start = std::chrono::steady_clock::now();

    string const x = "China";
    std::string res;

    for ( int i = 0; i < 1000000; i++ )
    {
        mymap.find( x );
    }

    auto const elapsed = std::chrono::steady_clock::now() - start;
    printf( "%s\n", res );
    printf( "took %d ms\n",
            std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() );
    return 0;
}

在我的系统上运行它，我得到了大约 68 毫秒的运行时间，输出如下：

hash(Japan) = 3611029618d
hash(Spain) = 749986602d
hash(China) = 3154384700d
hash(U.S.) = 2546447179d
hash(Italy) = 2246786301d
hash(Germany) = 2319993784d
hash(U.K.) = 2699630607d
hash(France) = 3266727934d
hash(Russia) = 3992029278d
size of bucket 0 = 0
size of bucket 1 = 0
size of bucket 2 = 1
size of bucket 3 = 1
size of bucket 4 = 1
size of bucket 5 = 0
size of bucket 6 = 1
size of bucket 7 = 0
size of bucket 8 = 0
size of bucket 9 = 2
size of bucket 10 = 3

事实证明，哈希表没有得到很好的优化并且包含一些冲突。进一步打印bucket中的元素显示西班牙和中国在bucket 9中。bucket可能是一个链表，节点分布在内存中，说明性能下降。

如果您选择另一个没有冲突的哈希表大小，您可以获得加速。我通过添加 mymap.rehash(1001) 对其进行了测试，并在 44-52 毫秒之间获得了 20-30% 的加速。

现在，另一点是计算“中国”的哈希值。该函数在每次迭代中执行。当我们切换到常量纯 C 字符串时，我们可以让它消失：

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>

static auto constexpr us = "U.S.";
static auto constexpr uk = "U.K.";
static auto constexpr fr = "France";
static auto constexpr ru = "Russia";
static auto constexpr cn = "China";
static auto constexpr ge = "Germany";
static auto constexpr jp = "Japan";
static auto constexpr it = "Italy";
static auto constexpr sp = "Spain";

using namespace std;
int main()
{
    std::unordered_map<const char*, std::string> mymap;

    // populating container:
    mymap[us] = "Washington";
    mymap[uk] = "London";
    mymap[fr] = "Paris";
    mymap[ru] = "Moscow";
    mymap[cn] = "Beijing";
    mymap[ge] = "Berlin";
    mymap[jp] = "Tokyo";
    mymap[it] = "Rome";
    mymap[sp] = "Madrad";

    string const x = "China";
    char const* res = nullptr;
    auto const start = std::chrono::steady_clock::now();
    for ( int i = 0; i < 1000000; i++ )
    {
        res = mymap[cn].c_str();
    }

    auto const elapsed = std::chrono::steady_clock::now() - start;
    printf( "%s\n", res );
    printf( "took %d ms\n",
            std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() );
    return 0;
}

在我的机器上，这将运行时间减少了 50% 到 ~20 毫秒。不同之处在于，它不再从字符串内容计算散列值，而是将地址转换为更快的散列值，因为它只是将地址值作为 size_t 返回。我们也不需要重新哈希，因为与 cn 的存储桶没有冲突。

【讨论】：

切换到char* 意味着您使用引用相等，因此如果您传递来自不同来源的字符串，代码将无法工作。
@CodesInChaos 你是对的，但我认为这不是那个微基准测试的重点。
感谢@Jens 的详细解释。碰撞确实是个大问题。使用不同的键“France”可将性能提高约 25%。比较指针确实可以极大地提高性能，但这不是我们在这里需要的。我相信性能与执行的指令数量有很大关系，正如 james-picone 在生成的程序集上指出的那样。在问题上创建了更新 5。谢谢。
@codingFun 使用指针的示例表明，运行时的很大一部分时间都花在了计算哈希值上，因为这是随类型变化的。我猜 go 版本要么对字符串有更好的哈希函数，也许缓存 has 值并因此只计算一次，或者它使用其他策略来实现小字符串的映射。
@jens，这里也是。只希望 C++ 实现能够赶上 go。

【解决方案3】：

这只是表明，对于这个特定的用例，Go 哈希映射的实现得到了很好的优化。

mymap["China"] 调用专门为字符串键优化的mapaccess1_faststr。特别是对于小的一桶映射，甚至不计算短（小于 32 字节）字符串的哈希码。

【讨论】：

【解决方案4】：

这是一个猜测：

unordered_map::operator[] 需要一个字符串参数。您正在提供一个 char*。如果不进行优化，C++ 版本可能会调用 std::string(char*) 构造函数一百万次，以便将“China”转换为 std::string。 Go 的语言规范可能使“字符串文字”与字符串具有相同类型，因此不需要调用构造函数。

启用优化后，字符串构造函数将脱离循环，您不会看到同样的问题。或者很可能除了获取时间的两个系统调用和打印差异的系统调用之外不会生成任何代码，因为最终这一切都没有效果。

要确认这一点，您必须实际查看正在生成的程序集。那将是一个编译器选项。有关 GCC 所需的标志，请参阅 this 问题。

【讨论】：

删除额外的构造函数调用会有所帮助，但似乎 C++ 版本仍然相当慢。谢谢！
我还是很好奇 Go 程序集和 C++ 程序集有什么区别。