测量内存访问时间 x86答案

【问题标题】：Measuring memory access time x86测量内存访问时间 x86
【发布时间】：2018-06-22 22:21:50
【问题描述】：

我尝试测量缓存/非缓存内存访问时间，结果让我感到困惑。

代码如下：

  1 #include <stdio.h>                                                              
  2 #include <x86intrin.h>                                                          
  3 #include <stdint.h>                                                             
  4                                                                                 
  5 #define SIZE 32*1024                                                            
  6                                                                                 
  7 char arr[SIZE];                                                                 
  8                                                                                 
  9 int main()                                                                      
 10 {                                                                               
 11     char *addr;                                                                 
 12     unsigned int dummy;                                                         
 13     uint64_t tsc1, tsc2;                                                        
 14     unsigned i;                                                                 
 15     volatile char val;                                                          
 16                                                                                 
 17     memset(arr, 0x0, SIZE);                                                     
 18     for (addr = arr; addr < arr + SIZE; addr += 64) {                           
 19         _mm_clflush((void *) addr);                                             
 20     }                                                                           
 21     asm volatile("sfence\n\t"                                                   
 22             :                                                                   
 23             :                                                                   
 24             : "memory");                                                        
 25                                                                                 
 26     tsc1 = __rdtscp(&dummy);                                                    
 27     for (i = 0; i < SIZE; i++) {                                                
 28         asm volatile (                                                          
 29                 "mov %0, %%al\n\t"  // load data                                
 30                 :                                                               
 31                 : "m" (arr[i])                                                  
 32                 );                                                              
 33                                                                                 
 34     }                                                                           
 35     tsc2 = __rdtscp(&dummy);                                                    
 36     printf("(1) tsc: %llu\n", tsc2 - tsc1);                                     
 37                                                                                 
 38     tsc1 = __rdtscp(&dummy);                                                    
 39     for (i = 0; i < SIZE; i++) {                                                
 40         asm volatile (                                                          
 41                 "mov %0, %%al\n\t"  // load data                                
 42                 :                                                               
 43                 : "m" (arr[i])                                                  
 44                 );                                                              
 45                                                                                 
 46     }                                                                           
 47     tsc2 = __rdtscp(&dummy);                                                    
 48     printf("(2) tsc: %llu\n", tsc2 - tsc1);                                     
 49                                                                                 
 50     return 0;                                                                   
 51 }

输出：

(1) tsc: 451248
(2) tsc: 449568

我预计，第一个值会大得多，因为在情况 (1) 中缓存被 clflush 无效。

关于我的 cpu（Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz）缓存的信息：

Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true

Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true

Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true

Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true

两条rdtscp指令之间的代码反汇编

  400614:       0f 01 f9                rdtscp 
  400617:       89 ce                   mov    %ecx,%esi
  400619:       48 8b 4d d8             mov    -0x28(%rbp),%rcx
  40061d:       89 31                   mov    %esi,(%rcx)
  40061f:       48 c1 e2 20             shl    $0x20,%rdx
  400623:       48 09 d0                or     %rdx,%rax
  400626:       48 89 45 c0             mov    %rax,-0x40(%rbp)
  40062a:       c7 45 b4 00 00 00 00    movl   $0x0,-0x4c(%rbp)
  400631:       eb 0d                   jmp    400640 <main+0x8a>
  400633:       8b 45 b4                mov    -0x4c(%rbp),%eax
  400636:       8a 80 80 10 60 00       mov    0x601080(%rax),%al
  40063c:       83 45 b4 01             addl   $0x1,-0x4c(%rbp)
  400640:       81 7d b4 ff 7f 00 00    cmpl   $0x7fff,-0x4c(%rbp)
  400647:       76 ea                   jbe    400633 <main+0x7d>
  400649:       48 8d 45 b0             lea    -0x50(%rbp),%rax
  40064d:       48 89 45 e0             mov    %rax,-0x20(%rbp)
  400651:       0f 01 f9                rdtscp

看起来我缺少/误解了一些东西。你能推荐一下吗？

【问题讨论】：

标签： performance caching assembly memory x86

【解决方案1】：

mov %0, %%al 非常慢（每 64 个时钟一个高速缓存行，or per 32 clocks on Sandybridge specifically (not Haswell or later)），无论您的负载最终来自 DRAM 还是 L1D，您都可能会遇到瓶颈。

只有每 64 次加载会在缓存中丢失，因为您通过微小的字节加载循环充分利用了空间局部性。如果您真的想测试在刷新 L1D 大小的块后缓存可以多快重新填充，您应该使用 SIMD movdqa 循环，或者仅以 64 的步幅加载字节。（您只需触摸每个缓存一个字节行）。

为避免对 RAX 旧值的错误依赖，您应该使用movzbl %0, %eax。这将让 Sandybridge 及更高版本（或自 K8 起的 AMD）使用其每时钟 2 次负载的满负载吞吐量来保持内存管道接近满载。多个缓存未命中可以同时进行：英特尔 CPU 内核有 10 个 LFB（行填充缓冲区）用于往返 L1D 的行，或 16 个超级队列条目用于从 L2 到离核的行。另见Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?。（多核 Xeon 芯片的单线程内存带宽比台式机/笔记本电脑差。）

但你的瓶颈远不止于此！

您编译时禁用了优化，因此您的循环使用 addl $0x1,-0x4c(%rbp) 作为循环计数器，这为您提供了至少 6 个循环的循环携带依赖链。（存储/重新加载存储转发延迟 + ALU 添加的 1 个周期。）http://agner.org/optimize/

（可能更高，因为加载端口的资源冲突。i7-720 是 Nehalem 微架构，因此只有一个加载端口。）

这绝对意味着您的循环不会成为缓存未命中的瓶颈，并且无论您是否使用clflush，它的运行速度都可能大致相同。

还要注意rdtsc 计算参考周期，而不是核心时钟周期。即，无论 CPU 运行速度较慢（省电）还是更快（Turbo），它在您的 1.7GHz CPU 上始终计数为 1.7GHz。通过热身循环对此进行控制。

您也没有在eax 上声明一个clobber，因此编译器不希望您的代码修改rax。你最终得到mov 0x601080(%rax),%al。但是 gcc 每次迭代都会从内存中重新加载 rax，并且不使用您修改的 rax，因此您实际上不会像使用优化编译时那样在内存中跳过。

提示：如果您想让编译器实际加载，请使用volatile char *，而不是优化它以减少更广泛的加载。你不需要内联汇编。

【讨论】：