使用 1GB 页面会降低性能答案

【问题标题】：Using 1GB pages degrade performance使用 1GB 页面会降低性能
【发布时间】：2020-12-21 14:32:32
【问题描述】：

我有一个应用程序，我需要大约 850 MB 的连续内存并以随机方式访问它。有人建议我分配一个 1 GB 的大页面，以便它始终位于 TLB 中。我编写了一个带有顺序/随机访问的演示，以测量小（在我的情况下为 4 KB）与大（1 GB）页面的性能：

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>

#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) // Aren't used in this example.
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#define MESSINESS_LEVEL 512 // Poisons caches if LRU policy is used.

#define RUN_TESTS 25

void print_usage() {
  printf("Usage: ./program small|huge1gb sequential|random\n");
}

int main(int argc, char *argv[]) {
  if (argc != 3 && argc != 4) {
    print_usage();
    return -1;
  }
  uint64_t size = 1UL * 1024 * 1024 * 1024; // 1GB
  uint32_t *ptr;
  if (strcmp(argv[1], "small") == 0) {
    ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, // basically malloc(size);
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (ptr == MAP_FAILED) {
      perror("mmap small");
      exit(1);
    }
  } else if (strcmp(argv[1], "huge1gb") == 0) {
    ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
    if (ptr == MAP_FAILED) {
      perror("mmap huge1gb");
      exit(1);
    }
  } else {
    print_usage();
    return -1;
  }

  clock_t start_time, end_time;
  start_time = clock();

  if (strcmp(argv[2], "sequential") == 0) {
    for (int iter = 0; iter < RUN_TESTS; iter++) {
      for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
        ptr[i] = i * 5;
    }
  } else if (strcmp(argv[2], "random") == 0) {
    // pseudorandom access pattern, defeats caches.
    uint64_t index;
    for (int iter = 0; iter < RUN_TESTS; iter++) {
      for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
        for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
          index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
          ptr[index] = index * 5;
        }
      }
    }
  } else {
    print_usage();
    return -1;
  }

  end_time = clock();
  long double duration = (long double)(end_time - start_time) / CLOCKS_PER_SEC;
  printf("Avr. Duration per test: %Lf\n", duration / RUN_TESTS);
  //  write(1, ptr, size); // Dumps memory content (1GB to stdout).
}

在我的机器上（更多如下）结果是：

顺序：

$ ./test small sequential
Avr. Duration per test: 0.562386
$ ./test huge1gb sequential        <--- slightly better
Avr. Duration per test: 0.543532

随机：

$ ./test small random              <--- better
Avr. Duration per test: 2.911480
$ ./test huge1gb random
Avr. Duration per test: 6.461034

我对随机测试感到困扰，似乎 1GB 的页面慢了 2 倍！我尝试使用 madvise 和 MADV_SEQUENTIAL / MADV_SEQUENTIAL 进行各自的测试，但没有帮助。

为什么在随机访问的情况下使用一个巨大的页面会降低性能？大页面（2MB 和 1GB）的一般用例是什么？

我没有用 2MB 页面测试这段代码，我认为它应该会做得更好。我还怀疑，由于一个 1GB 页面存储在一个内存库中，它可能与 multi-channels 有关。但我想听听你们的意见。谢谢。

注意：要运行测试，您必须首先在内核中启用 1GB 页面。你可以通过给内核这个参数hugepagesz=1G hugepages=1 default_hugepagesz=1G来做到这一点。更多：https://wiki.archlinux.org/index.php/Kernel_parameters。如果启用，你应该得到类似的东西：

$ cat /proc/meminfo | grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         1048576 kB

EDIT1：我的机器有 Core i5 8600 和 4 个内存库，每个 4 GB。 CPU 本身支持 2MB 和 1GB 页面（它具有 pse 和 pdpe1gb 标志，请参阅：https://wiki.debian.org/Hugepages#x86_64）。我测量的是机器时间，而不是 CPU 时间，我更新了代码，结果现在是 25 次测试的平均值。

我还被告知，这个测试在 2MB 页面上比在普通 4KB 页面上表现更好。

【问题讨论】：

你脱离了上下文。连续的虚拟地址空间在物理地址空间中是不连续的。如果你认为分配一块内存会减少页面错误从而提高性能，那么在系统中，结果通常是反直觉的。
@TonyTannous Huge pages - 如果支持 - 在物理内存中是连续的
难道你不应该同时使用MAP_POPULATE 和MAP_LOCKED，除非你想专门测试故障性能？无论如何，您应该可以使用perf 来查看 TLB、缓存和其他硬件计数器。
@TonyTannous 据我所知，一个虚拟页面，如果我们在我的情况下谈论内存映射（但它也可能是文件映射/设备/等），对应于一个物理页面具有确切大小或具有该大小的连续内存块。 x86_64 ISA 支持 2MB 和 1GB 页面：wiki.debian.org/Hugepages#x86_64。
我确认您的观察，1GB 页面随机访问比 Skylake 上的 4kB 页面慢两倍。很奇特。

标签： c linux virtual-memory tlb

【解决方案1】：

不是答案，而是为这个令人困惑的问题提供更多细节。

性能计数器显示的指令数量大致相似，但大约是使用大页面时所用周期数的两倍：

4KiB 页面 IPC 0.29，
1GiB 页面 IPC 0.10。

这些IPC 数字表示代码在内存访问方面存在瓶颈（Skylake 上的 CPU 绑定 IPC 为 3 及以上）。巨大的页面更难成为瓶颈。

我修改了您的基准测试，在这两种情况下都使用MAP_POPULATE | MAP_LOCKED | MAP_FIXED 和固定地址0x600000000000，以消除与页面错误和随机映射地址相关的时间变化。在我的 Skylake 系统上，2MiB 和 1GiB 比 4kiB 页面慢 2 倍以上。

用g++-8.4.0 -std=gnu++14 -pthread -m{arch,tune}=skylake -O3 -DNDEBUG编译：

[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 2MB:64 --pool-pages-max 2MB:64
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
[max@supernova:~/src/test] $ for s in small huge; do sudo chrt -f 40 taskset -c 7 perf stat -dd ./release/gcc/test $s random; done
Duration: 2156150

 Performance counter stats for './release/gcc/test small random':

       2291.190394      task-clock (msec)         #    1.000 CPUs utilized          
                 1      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                53      page-faults               #    0.023 K/sec                  
    11,448,252,551      cycles                    #    4.997 GHz                      (30.83%)
     3,268,573,978      instructions              #    0.29  insn per cycle           (38.55%)
       430,248,155      branches                  #  187.784 M/sec                    (38.55%)
           758,917      branch-misses             #    0.18% of all branches          (38.55%)
       224,593,751      L1-dcache-loads           #   98.025 M/sec                    (38.55%)
       561,979,341      L1-dcache-load-misses     #  250.22% of all L1-dcache hits    (38.44%)
       271,067,656      LLC-loads                 #  118.309 M/sec                    (30.73%)
           668,118      LLC-load-misses           #    0.25% of all LL-cache hits     (30.73%)
   <not supported>      L1-icache-loads                                             
           220,251      L1-icache-load-misses                                         (30.73%)
       286,864,314      dTLB-loads                #  125.203 M/sec                    (30.73%)
             6,314      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.73%)
                29      iTLB-loads                #    0.013 K/sec                    (30.73%)
             6,366      iTLB-load-misses          # 21951.72% of all iTLB cache hits  (30.73%)

       2.291300162 seconds time elapsed

Duration: 4349681

 Performance counter stats for './release/gcc/test huge random':

       4385.282466      task-clock (msec)         #    1.000 CPUs utilized          
                 1      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                53      page-faults               #    0.012 K/sec                  
    21,911,541,450      cycles                    #    4.997 GHz                      (30.70%)
     2,175,972,910      instructions              #    0.10  insn per cycle           (38.45%)
       274,356,392      branches                  #   62.563 M/sec                    (38.54%)
           560,941      branch-misses             #    0.20% of all branches          (38.63%)
         7,966,853      L1-dcache-loads           #    1.817 M/sec                    (38.70%)
       292,131,592      L1-dcache-load-misses     # 3666.84% of all L1-dcache hits    (38.65%)
            27,531      LLC-loads                 #    0.006 M/sec                    (30.81%)
            12,413      LLC-load-misses           #   45.09% of all LL-cache hits     (30.72%)
   <not supported>      L1-icache-loads                                             
           353,438      L1-icache-load-misses                                         (30.65%)
         7,252,590      dTLB-loads                #    1.654 M/sec                    (30.65%)
               440      dTLB-load-misses          #    0.01% of all dTLB cache hits   (30.65%)
               274      iTLB-loads                #    0.062 K/sec                    (30.65%)
             9,577      iTLB-load-misses          # 3495.26% of all iTLB cache hits   (30.65%)

       4.385392278 seconds time elapsed

在 Ubuntu 18.04.5 LTS 上运行，配备 Intel i9-9900KS（不是 NUMA），所有 4 个插槽均配备 4x8GiB 4GHz CL17 RAM，performance 调速器用于无 CPU 频率缩放，最大液体冷却风扇无散热节流，FIFO 40 优先级，不抢占，在一个特定的 CPU 内核上，不迁移 CPU，多次运行。与clang++-8.0.0编译器的结果类似。

感觉在硬件中有些问题，例如每个页面帧的存储缓冲区，因此 4KiB 页面允许每单位时间大约 2 倍的存储。

看看 AMD Ryzen 3 CPU 的结果会很有趣。

在 AMD Ryzen 3 5950X 上，大页面版本仅慢 10%：

Duration: 1578723

 Performance counter stats for './release/gcc/test small random':

          1,726.89 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,947      page-faults               #    0.001 M/sec                  
     8,189,576,204      cycles                    #    4.742 GHz                      (33.02%)
         3,174,036      stalled-cycles-frontend   #    0.04% frontend cycles idle     (33.14%)
            95,950      stalled-cycles-backend    #    0.00% backend cycles idle      (33.25%)
     3,301,760,473      instructions              #    0.40  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (33.37%)
       480,276,481      branches                  #  278.116 M/sec                    (33.49%)
           864,075      branch-misses             #    0.18% of all branches          (33.59%)
       709,483,403      L1-dcache-loads           #  410.844 M/sec                    (33.59%)
     1,608,181,551      L1-dcache-load-misses     #  226.67% of all L1-dcache accesses  (33.59%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             
        78,963,441      L1-icache-loads           #   45.726 M/sec                    (33.59%)
            46,639      L1-icache-load-misses     #    0.06% of all L1-icache accesses  (33.51%)
       301,463,437      dTLB-loads                #  174.570 M/sec                    (33.39%)
       301,698,272      dTLB-load-misses          #  100.08% of all dTLB cache accesses  (33.28%)
                54      iTLB-loads                #    0.031 K/sec                    (33.16%)
             2,774      iTLB-load-misses          # 5137.04% of all iTLB cache accesses  (33.05%)
       243,732,886      L1-dcache-prefetches      #  141.140 M/sec                    (33.01%)
   <not supported>      L1-dcache-prefetch-misses                                   

       1.727052901 seconds time elapsed

       1.579089000 seconds user
       0.147914000 seconds sys

Duration: 1628512

 Performance counter stats for './release/gcc/test huge random':

          1,680.06 msec task-clock                #    1.000 CPUs utilized          
                 1      context-switches          #    0.001 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
             1,947      page-faults               #    0.001 M/sec                  
     8,037,708,678      cycles                    #    4.784 GHz                      (33.34%)
         4,684,831      stalled-cycles-frontend   #    0.06% frontend cycles idle     (33.34%)
         2,445,415      stalled-cycles-backend    #    0.03% backend cycles idle      (33.34%)
     2,217,699,442      instructions              #    0.28  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (33.34%)
       281,522,918      branches                  #  167.567 M/sec                    (33.34%)
           549,427      branch-misses             #    0.20% of all branches          (33.33%)
       312,930,677      L1-dcache-loads           #  186.261 M/sec                    (33.33%)
     1,614,505,314      L1-dcache-load-misses     #  515.93% of all L1-dcache accesses  (33.33%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             
           888,872      L1-icache-loads           #    0.529 M/sec                    (33.33%)
            13,140      L1-icache-load-misses     #    1.48% of all L1-icache accesses  (33.33%)
             9,168      dTLB-loads                #    0.005 M/sec                    (33.33%)
               870      dTLB-load-misses          #    9.49% of all dTLB cache accesses  (33.33%)
             1,173      iTLB-loads                #    0.698 K/sec                    (33.33%)
             1,914      iTLB-load-misses          #  163.17% of all iTLB cache accesses  (33.33%)
       253,307,275      L1-dcache-prefetches      #  150.772 M/sec                    (33.33%)
   <not supported>      L1-dcache-prefetch-misses                                   

       1.680230802 seconds time elapsed

       1.628170000 seconds user
       0.052005000 seconds sys

【讨论】：

巨大的测试确实有明显更多的 iTLB 加载和未命中以及更多 icache 加载未命中。这似乎很奇怪。
@AndrewHenle 这些输出确实很奇怪。 L1-dcache-loads 6,758,085，但是L1-dcache-load-misses 293,418,903，如何解释？不应该L1-dcache-loads >= L1-dcache-load-misses吗？还是应该是L1-dcache-loads / (L1-dcache-loads + L1-dcache-load-misses)？ perf 不这么认为 L1-dcache-load-misses/L1-dcache-loads == 4341.75%。
@AndrewHenle 我在生产中使用大页面，它们经过基准测试，在 Xeons 上的生产工作负载上显示出更好的时序。但是这个简单的基准显示了一些从根本上被误解或被大页面破坏的东西，至少在 Skylake 上。我在进行基准测试时会进行尽职调查，例如以3 或s 级别启动内核，将performance 调节器设置为最大，CPU 风扇设置为最大，以FIFO 实时优先级多次运行。
我完全同意这一点。我想知道实际的指令时间是什么？我确实找到了这个：Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code 现在我希望我有一些新的硬件可以试验，即使我没有你在英特尔硬件上进行这种分析的经验。我现在可以访问的所有内容都非常古老。
@AndrewHenle 谢谢，但我 99% 的分析经验是查看每个数字并应用常识。最原始且受广泛支持的 CPU 周期计数器可以让您走得很远，不需要带有花哨计数器的最新 CPU。 perf record -e cycles:uppp -c 10000 <app> 后跟 perf report -Mintel 显示 CPU 周期的使用位置。如果从/到内存的加载/存储显示消耗了许多周期，这意味着它在内存访问上遇到了瓶颈（99% 的时间都是这种情况）——没有火箭科学——只需要一个基本的 CPU 周期计数器即可获得良好的洞察力。

【解决方案2】：

英特尔很友好地回复了这个问题。请参阅下面的答案。

此问题是由于实际提交物理页面的方式所致。在 1GB 页面的情况下，内存是连续的。因此，只要您写入 1GB 页面中的任何一个字节，就会分配整个 1GB 页面。但是，对于 4KB 页面，物理页面会在您第一次触摸每个 4KB 页面时分配。

for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
   for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
       index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
           ptr[index] = index * 5;
   }
}

在最里面的循环中，索引以 512KB 的步幅变化。因此，连续引用映射在 512KB 偏移处。通常缓存有 2048 个集合（即 2^11）。因此，位 6:16 选择集合。但是，如果您以 512KB 偏移量大步前进，那么位 6:16 将是相同的，最终会选择相同的集合并失去空间局部性。

我们建议在开始计时之前按如下顺序（在小页面测试中）初始化整个 1GB 缓冲区

for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
    ptr[i] = i * 5;

基本上，与小页面相比，由于非常大的常量偏移量，与小页面相比，设置冲突会导致缓存未命中。当你使用常量偏移量时，测试真的不是随机的。

【讨论】：