测量缓存延迟答案

【问题标题】：Measuring Cache Latencies测量缓存延迟
【发布时间】：2014-02-17 14:44:11
【问题描述】：

所以我尝试使用 C 来测量 L1、L2、L3 缓存的延迟。我知道它们的大小，并且我觉得我在概念上理解如何做到这一点，但我的实现遇到了问题。我想知道其他一些复杂的硬件（例如预取）是否会导致问题。

#include <time.h>
#include <stdio.h>
#include <string.h>

int main(){
    srand(time(NULL));  // Seed ONCE
    const int L1_CACHE_SIZE =  32768/sizeof(int);
    const int L2_CACHE_SIZE =  262144/sizeof(int);
    const int L3_CACHE_SIZE =  6587392/sizeof(int);
    const int NUM_ACCESSES = 1000000;
    const int SECONDS_PER_NS = 1000000000;
    int arrayAccess[L1_CACHE_SIZE];
    int arrayInvalidateL1[L1_CACHE_SIZE];
    int arrayInvalidateL2[L2_CACHE_SIZE];
    int arrayInvalidateL3[L3_CACHE_SIZE];
    int count=0;
    int index=0;
    int i=0;
    struct timespec startAccess, endAccess;
    double mainMemAccess, L1Access, L2Access, L3Access;
    int readValue=0;

    memset(arrayAccess, 0, L1_CACHE_SIZE*sizeof(int));
    memset(arrayInvalidateL1, 0, L1_CACHE_SIZE*sizeof(int));
    memset(arrayInvalidateL2, 0, L2_CACHE_SIZE*sizeof(int));
    memset(arrayInvalidateL3, 0, L3_CACHE_SIZE*sizeof(int));

    index = 0;
    clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    while (index < L1_CACHE_SIZE) {
        int tmp = arrayAccess[index];               //Access Value from L2
        index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
        count++;                                           //divide overall time by this 
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
    mainMemAccess = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
    mainMemAccess /= count;

    printf("Main Memory Access %lf\n", mainMemAccess);

    index = 0;
    count=0;
    clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    while (index < L1_CACHE_SIZE) {
        int tmp = arrayAccess[index];               //Access Value from L2
        index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
        count++;                                           //divide overall time by this 
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock              
    L1Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
    L1Access /= count;

    printf("L1 Cache Access %lf\n", L1Access);

    //invalidate L1 by accessing all elements of array which is larger than cache
    for(count=0; count < L1_CACHE_SIZE; count++){
        int read = arrayInvalidateL1[count]; 
        read++;
        readValue+=read;               
    }

    index = 0;
    count = 0;
    clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    while (index < L1_CACHE_SIZE) {
        int tmp = arrayAccess[index];               //Access Value from L2
        index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
        count++;                                           //divide overall time by this 
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
    L2Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
    L2Access /= count;

    printf("L2 Cache Acces %lf\n", L2Access);

    //invalidate L2 by accessing all elements of array which is larger than cache
    for(count=0; count < L2_CACHE_SIZE; count++){
        int read = arrayInvalidateL2[count];  
        read++;
        readValue+=read;                        
    }

    index = 0;
    count=0;
    clock_gettime(CLOCK_REALTIME, &startAccess); //sreadValue+=read;tart clock
    while (index < L1_CACHE_SIZE) {
        int tmp = arrayAccess[index];               //Access Value from L2
        index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
        count++;                                           //divide overall time by this 
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
    L3Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
    L3Access /= count;

    printf("L3 Cache Access %lf\n", L3Access);

    printf("Read Value: %d", readValue);

}

我首先访问要从中获取数据的数组中的一个值。这显然应该来自主内存，因为它是第一次访问。该数组很小（小于页面大小），因此应将其复制到 L1、L2、L3。我从现在应该是 L1 的同一个数组中访问值。然后，我从与 L1 缓存大小相同的数组中访问所有值，以使我想要访问的数据无效（所以现在它应该只是在 L2/3 中）。然后我对 L2 和 L3 重复这个过程。不过访问时间明显不对，这意味着我做错了什么......

我认为计时所需的时间可能存在问题（启动和停止在 ns 中会花费一些时间，并且在缓存/未缓存时会发生变化）

谁能给我一些关于我可能做错了什么的指点？

UPDATE1：所以我通过进行大量访问来分摊计时器的成本，我固定了缓存的大小，并且我还接受了建议以制定更复杂的索引方案以避免固定步幅。不幸的是，时代还没有结束。他们似乎都为L1而来。我认为问题可能在于无效而不是访问。随机 vs LRU 方案会影响失效的数据吗？

UPDATE2：修复了 memset（添加了 L3 memset 以使 L3 中的数据也无效，因此第一次访问从主内存开始）和索引方案，仍然没有运气。

更新 3：我无法让这种方法发挥作用，但有一些很好的建议答案，我发布了一些我自己的解决方案。

我还运行 Cachegrind 来查看命中/未命中

 ==6710== I   refs:      1,735,104
==6710== I1  misses:        1,092
==6710== LLi misses:        1,084
==6710== I1  miss rate:      0.06%
==6710== LLi miss rate:      0.06%
==6710== 
==6710== D   refs:      1,250,696  (721,162 rd   + 529,534 wr)
==6710== D1  misses:      116,492  (  7,627 rd   + 108,865 wr)
==6710== LLd misses:      115,102  (  6,414 rd   + 108,688 wr)
==6710== D1  miss rate:       9.3% (    1.0%     +    20.5%  )
==6710== LLd miss rate:       9.2% (    0.8%     +    20.5%  )
==6710== 
==6710== LL refs:         117,584  (  8,719 rd   + 108,865 wr)
==6710== LL misses:       116,186  (  7,498 rd   + 108,688 wr)
==6710== LL miss rate:        3.8% (    0.3%     +    20.5%  )


        Ir I1mr ILmr      Dr  D1mr  DLmr     Dw D1mw DLmw 

      .    .    .       .     .     .      .    .    .  #include <time.h>
      .    .    .       .     .     .      .    .    .  #include <stdio.h>
      .    .    .       .     .     .      .    .    .  #include <string.h>
      .    .    .       .     .     .      .    .    .  
      6    0    0       0     0     0      2    0    0  int main(){
      5    1    1       0     0     0      2    0    0      srand(time(NULL));  // Seed ONCE
      1    0    0       0     0     0      1    0    0      const int L1_CACHE_SIZE =  32768/sizeof(int);
      1    0    0       0     0     0      1    0    0      const int L2_CACHE_SIZE =  262144/sizeof(int);
      1    0    0       0     0     0      1    0    0      const int L3_CACHE_SIZE =  6587392/sizeof(int);
      1    0    0       0     0     0      1    0    0      const int NUM_ACCESSES = 1000000;
      1    0    0       0     0     0      1    0    0      const int SECONDS_PER_NS = 1000000000;
     21    2    2       3     0     0      3    0    0      int arrayAccess[L1_CACHE_SIZE];
     21    1    1       3     0     0      3    0    0      int arrayInvalidateL1[L1_CACHE_SIZE];
     21    2    2       3     0     0      3    0    0      int arrayInvalidateL2[L2_CACHE_SIZE];
     21    1    1       3     0     0      3    0    0      int arrayInvalidateL3[L3_CACHE_SIZE];
      1    0    0       0     0     0      1    0    0      int count=0;
      1    1    1       0     0     0      1    0    0      int index=0;
      1    0    0       0     0     0      1    0    0      int i=0;
      .    .    .       .     .     .      .    .    .      struct timespec startAccess, endAccess;
      .    .    .       .     .     .      .    .    .      double mainMemAccess, L1Access, L2Access, L3Access;
      1    0    0       0     0     0      1    0    0      int readValue=0;
      .    .    .       .     .     .      .    .    .  
      7    0    0       2     0     0      1    1    1      memset(arrayAccess, 0, L1_CACHE_SIZE*sizeof(int));
      7    1    1       2     2     0      1    0    0      memset(arrayInvalidateL1, 0, L1_CACHE_SIZE*sizeof(int));
      7    0    0       2     2     0      1    0    0      memset(arrayInvalidateL2, 0, L2_CACHE_SIZE*sizeof(int));
      7    1    1       2     2     0      1    0    0      memset(arrayInvalidateL3, 0, L3_CACHE_SIZE*sizeof(int));
      .    .    .       .     .     .      .    .    .  
      1    0    0       0     0     0      1    1    1      index = 0;
      4    0    0       0     0     0      1    0    0      clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    772    1    1     514     0     0      0    0    0      while (index < L1_CACHE_SIZE) {
  1,280    1    1     768   257   257    256    0    0          int tmp = arrayAccess[index];               //Access Value from L2
  2,688    0    0     768     0     0    256    0    0          index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
    256    0    0     256     0     0      0    0    0          count++;                                           //divide overall time by this 
      .    .    .       .     .     .      .    .    .      }
      4    0    0       0     0     0      1    0    0      clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
     14    1    1       5     1     1      1    1    1      mainMemAccess = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
      6    0    0       2     0     0      1    0    0      mainMemAccess /= count;
      .    .    .       .     .     .      .    .    .  
      6    1    1       2     0     0      2    0    0      printf("Main Memory Access %lf\n", mainMemAccess);
      .    .    .       .     .     .      .    .    .  
      1    0    0       0     0     0      1    0    0      index = 0;
      1    0    0       0     0     0      1    0    0      count=0;
      4    1    1       0     0     0      1    0    0      clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    772    1    1     514     0     0      0    0    0      while (index < L1_CACHE_SIZE) {
  1,280    0    0     768   240     0    256    0    0          int tmp = arrayAccess[index];               //Access Value from L2
  2,688    0    0     768     0     0    256    0    0          index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
    256    0    0     256     0     0      0    0    0          count++;                                           //divide overall time by this 
      .    .    .       .     .     .      .    .    .      }
      4    0    0       0     0     0      1    0    0      clock_gettime(CLOCK_REALTIME, &endAccess); //end clock              
     14    1    1       5     0     0      1    1    0      L1Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
      6    1    1       2     0     0      1    0    0      L1Access /= count;
      .    .    .       .     .     .      .    .    .  
      6    0    0       2     0     0      2    0    0      printf("L1 Cache Access %lf\n", L1Access);
      .    .    .       .     .     .      .    .    .  
      .    .    .       .     .     .      .    .    .      //invalidate L1 by accessing all elements of array which is larger than cache
 32,773    1    1  24,578     0     0      1    0    0      for(count=0; count < L1_CACHE_SIZE; count++){
 40,960    0    0  24,576   513   513  8,192    0    0          int read = arrayInvalidateL1[count]; 
  8,192    0    0   8,192     0     0      0    0    0          read++;
 16,384    0    0  16,384     0     0      0    0    0          readValue+=read;               
      .    .    .       .     .     .      .    .    .      }
      .    .    .       .     .     .      .    .    .  
      1    0    0       0     0     0      1    0    0      index = 0;
      1    1    1       0     0     0      1    0    0      count = 0;
      4    0    0       0     0     0      1    1    0      clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    772    1    1     514     0     0      0    0    0      while (index < L1_CACHE_SIZE) {
  1,280    0    0     768   256     0    256    0    0          int tmp = arrayAccess[index];               //Access Value from L2
  2,688    0    0     768     0     0    256    0    0          index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
    256    0    0     256     0     0      0    0    0          count++;                                           //divide overall time by this 
      .    .    .       .     .     .      .    .    .      }
      4    1    1       0     0     0      1    0    0      clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
     14    0    0       5     1     0      1    1    0      L2Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
      6    1    1       2     0     0      1    0    0      L2Access /= count;
      .    .    .       .     .     .      .    .    .  
      6    0    0       2     0     0      2    0    0      printf("L2 Cache Acces %lf\n", L2Access);
      .    .    .       .     .     .      .    .    .  
      .    .    .       .     .     .      .    .    .      //invalidate L2 by accessing all elements of array which is larger than cache
262,149    2    2 196,610     0     0      1    0    0      for(count=0; count < L2_CACHE_SIZE; count++){
327,680    0    0 196,608 4,097 4,095 65,536    0    0          int read = arrayInvalidateL2[count];  
 65,536    0    0  65,536     0     0      0    0    0          read++;
131,072    0    0 131,072     0     0      0    0    0          readValue+=read;                        
      .    .    .       .     .     .      .    .    .      }
      .    .    .       .     .     .      .    .    .  
      1    0    0       0     0     0      1    0    0      index = 0;
      1    0    0       0     0     0      1    0    0      count=0;
      4    0    0       0     0     0      1    1    0      clock_gettime(CLOCK_REALTIME, &startAccess); //sreadValue+=read;tart clock
    772    1    1     514     0     0      0    0    0      while (index < L1_CACHE_SIZE) {
  1,280    0    0     768   256     0    256    0    0          int tmp = arrayAccess[index];               //Access Value from L2
  2,688    0    0     768     0     0    256    0    0          index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
    256    0    0     256     0     0      0    0    0          count++;                                           //divide overall time by this 
      .    .    .       .     .     .      .    .    .      }
      4    0    0       0     0     0      1    0    0      clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
     14    1    1       5     1     0      1    1    0      L3Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec);
      6    0    0       2     0     0      1    0    0      L3Access /= count;
      .    .    .       .     .     .      .    .    .  
      6    1    1       2     0     0      2    0    0      printf("L3 Cache Access %lf\n", L3Access);
      .    .    .       .     .     .      .    .    .  
      6    0    0       1     0     0      1    0    0      printf("Read Value: %d", readValue);
      .    .    .       .     .     .      .    .    .  
      3    0    0       3     0     0      0    0    0  }

【问题讨论】：

使用 rdtsc 代替 clock_gettime 请参阅：[clock_gettime() 是否适合亚微秒计时？][1] [1]：stackoverflow.com/questions/7935518/…
不应该对宏伟的计划产生很大的影响，因为我通过大量访问来分散开销。
L1 可以从英特尔开发人员手册中得到解答。我很确定它在那里说 L1 访问的性能与寄存器访问的性能完全相同。硬件预取器做对的事情与它无可救药地搞砸的事情总是让我感到惊讶。
您使用的是什么处理器架构？
PandaRaid，Cachegrind 不是真的，它只是缓存的模拟器，它的缓存与 CPU 的真实缓存及其方式/未命中方案不完全匹配）。使用perf stat 获取实际命中/未命中总数，使用perf record 获取有关未命中指令的一些信息。

标签： c arrays performance caching memory

【解决方案1】：

不是一个真正的答案，但无论如何阅读，这里的其他答案和 cmets 已经提到了一些事情

前几天我回答了这个问题：

Cache size estimation on your system?

这是关于L1/L2/.../L?/MEMORY 传输速率的测量，请查看它以获得更好的问题起点

[备注]

我强烈建议使用 RDTSC 指令进行时间测量

尤其是对于 L1，因为其他任何东西都太慢了。不要忘记将进程关联设置为单个CPU，因为所有内核都有自己的计数器，即使在相同的输入时钟上它们的计数也有很大差异！！！

为可变时钟计算机将 CPU 时钟调整为最大值，如果您只使用 32 位部分（现代 CPU 在一秒钟内溢出 32 位计数器），请不要忘记考虑 RDTSC 溢出。对于时间计算，使用 CPU 时钟（测量它或使用注册表值）
```
t0 <- RDTSC
Sleep(250);
t1 <- RDTSC
CPU f=(t1-t0)<<2 [Hz]
```
将进程关联设置为单个 CPU

所有 CPU 内核通常都有自己的 L1,L2 缓存，因此在多任务 OS 上，如果不这样做，您可以测量令人困惑的事情这样做
做图形输出（图表）

然后你会看到上面那个链接中实际发生了什么我发布了很多情节
使用操作系统可用的最高进程优先级

【讨论】：

您确定内核之间的滴答计数器不同吗？现在，在 CPU 动态频率变化的时代，tsc 已经不再是 CPU 时钟（查看stackoverflow.com/a/19942784/196561），而是统一的相干时钟，它是从一些接近典型 CPU 频率的高频稳定信号开始计数的。当我们使用具有最高实际 cpu 时钟的 RDTSC 时，如果它的时钟也是可变的，我们将得到不正确的缓存延迟结果。
上次在 AMD phenon x3 上看到它，频率稳定。我的结论是，它是由不同的温度引起的（如果所有内核都有自己的 PLL）或者内核没有同时设置。没有在较新的 CPU 上测试它（总是使用亲和 1 作为时间测量线程）

【解决方案2】：

对于那些感兴趣的人来说，我无法让我的第一个代码集工作，所以我尝试了几种产生不错结果的替代方法。

第一个使用链表，其中节点在连续的内存空间中分配了步幅字节。节点的取消引用降低了预取器的有效性，并且在拉入多个缓存行的情况下，我大幅提高了步伐以避免缓存命中。随着分配的列表大小的增加，它会跳转到缓存或内存结构，该结构将保持它显示出明显的延迟划分。

#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>

//MACROS
#define ONE iterate = (char**) *iterate;
#define FIVE ONE ONE ONE
#define TWOFIVE FIVE FIVE FIVE FIVE FIVE
#define HUNDO TWOFIVE TWOFIVE TWOFIVE TWOFIVE

//prototype
void allocateRandomArray(long double);
void accessArray(char *, long double, char**);

int main(){
    //call the function for allocating arrays of increasing size in MB
    allocateRandomArray(.00049);
    allocateRandomArray(.00098);
    allocateRandomArray(.00195);
    allocateRandomArray(.00293);
    allocateRandomArray(.00391);
    allocateRandomArray(.00586);
    allocateRandomArray(.00781);
    allocateRandomArray(.01172);
    allocateRandomArray(.01562);
    allocateRandomArray(.02344);
    allocateRandomArray(.03125);
    allocateRandomArray(.04688);
    allocateRandomArray(.0625);
    allocateRandomArray(.09375);
    allocateRandomArray(.125);
    allocateRandomArray(.1875);
    allocateRandomArray(.25);
    allocateRandomArray(.375);
    allocateRandomArray(.5);
    allocateRandomArray(.75);
    allocateRandomArray(1);
    allocateRandomArray(1.5);
    allocateRandomArray(2);
    allocateRandomArray(3);
    allocateRandomArray(4);
    allocateRandomArray(6);
    allocateRandomArray(8);
    allocateRandomArray(12);
    allocateRandomArray(16);
    allocateRandomArray(24);
    allocateRandomArray(32);
    allocateRandomArray(48);
    allocateRandomArray(64);
    allocateRandomArray(96);
    allocateRandomArray(128);
    allocateRandomArray(192);
}

void allocateRandomArray(long double size){
    int accessSize=(1024*1024*size); //array size in bytes
    char * randomArray = malloc(accessSize*sizeof(char));    //allocate array of size allocate size
    int counter;
    int strideSize=4096;        //step size

    char ** head = (char **) randomArray;   //start of linked list in contiguous memory
    char ** iterate = head;         //iterator for linked list
    for(counter=0; counter < accessSize; counter+=strideSize){      
        (*iterate) = &randomArray[counter+strideSize];      //iterate through linked list, having each one point stride bytes forward
        iterate+=(strideSize/sizeof(iterate));          //increment iterator stride bytes forward
    }
    *iterate = (char *) head;       //set tailf to point to head

    accessArray(randomArray, size, head);
    free(randomArray);
}

void accessArray(char *cacheArray, long double size, char** head){
    const long double NUM_ACCESSES = 1000000000/100;    //number of accesses to linked list
    const int SECONDS_PER_NS = 1000000000;      //const for timer
    FILE *fp =  fopen("accessData.txt", "a");   //open file for writing data
    int newIndex=0;
    int counter=0;
    int read=0;
    struct timespec startAccess, endAccess;     //struct for timer
    long double accessTime = 0;
    char ** iterate = head;     //create iterator

    clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    for(counter=0; counter < NUM_ACCESSES; counter++){
        HUNDO       //macro subsitute 100 accesses to mitigate loop overhead
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
    //calculate the time elapsed in ns per access
    accessTime = (((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec)) / (100*NUM_ACCESSES);
    fprintf(fp, "%Lf\t%Lf\n", accessTime, size);  //print results to file
    fclose(fp);  //close file
}

这产生了最一致的结果，并且使用各种数组大小并绘制相应的延迟可以非常清楚地区分存在的不同缓存大小。

下一个方法就像之前分配的增加大小的数组一样。但是，我没有使用链表进行内存访问，而是用其各自的数字填充每个索引并随机打乱数组。然后我使用这些索引在数组中随机跳转以进行访问，从而减轻预取器的影响。但是，当多个相邻的缓存行被拉入并碰巧被命中时，它偶尔会在访问时间上有很大的偏差。

#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>

//prototype
void allocateRandomArray(long double);
void accessArray(int *, long int);

int main(){
    srand(time(NULL));  // Seed random function
    int i=0;
    for(i=2; i < 32; i++){
        allocateRandomArray(pow(2, i));         //call latency function on arrays of increasing size
    }


}

void allocateRandomArray(long double size){
    int accessSize = (size) / sizeof(int);
    int * randomArray = malloc(accessSize*sizeof(int));
    int counter;

    for(counter=0; counter < accessSize; counter ++){
        randomArray[counter] = counter; 
    }
    for(counter=0; counter < accessSize; counter ++){
        int i,j;
        int swap;
        i = rand() % accessSize;
        j = rand() % accessSize;
        swap = randomArray[i];
        randomArray[i] = randomArray[j];
        randomArray[j] = swap;
    } 

    accessArray(randomArray, accessSize);
    free(randomArray);
}

void accessArray(int *cacheArray, long int size){
    const long double NUM_ACCESSES = 1000000000;
    const int SECONDS_PER_NS = 1000000000;
    int newIndex=0;
    int counter=0;
    int read=0;
    struct timespec startAccess, endAccess;
    long double accessTime = 0;

    clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
    for(counter = 0; counter < NUM_ACCESSES; counter++){
        newIndex=cacheArray[newIndex];
    }
    clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
    //calculate the time elapsed in ns per access
    accessTime = (((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec)) / (NUM_ACCESSES);
    printf("Access time: %Lf for size %ld\n", accessTime, size);
}

经过多次试验平均，这种方法也产生了相对准确的结果。第一个选择肯定是两者中更好的一个，但这是另一种方法，效果也很好。

【讨论】：

【解决方案3】：

我宁愿尝试使用硬件时钟作为衡量标准。 rdtsc 指令将告诉您自 CPU 通电以来的当前循环计数。此外，最好使用asm 以确保在测量和空运行中始终使用相同的指令。使用这个和一些聪明的统计数据，我很久以前就做了这个：

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>


int i386_cpuid_caches (size_t * data_caches) {
    int i;
    int num_data_caches = 0;
    for (i = 0; i < 32; i++) {

        // Variables to hold the contents of the 4 i386 legacy registers
        uint32_t eax, ebx, ecx, edx; 

        eax = 4; // get cache info
        ecx = i; // cache id

        asm (
            "cpuid" // call i386 cpuid instruction
            : "+a" (eax) // contains the cpuid command code, 4 for cache query
            , "=b" (ebx)
            , "+c" (ecx) // contains the cache id
            , "=d" (edx)
        ); // generates output in 4 registers eax, ebx, ecx and edx 

        // taken from http://download.intel.com/products/processor/manual/325462.pdf Vol. 2A 3-149
        int cache_type = eax & 0x1F; 

        if (cache_type == 0) // end of valid cache identifiers
            break;

        char * cache_type_string;
        switch (cache_type) {
            case 1: cache_type_string = "Data Cache"; break;
            case 2: cache_type_string = "Instruction Cache"; break;
            case 3: cache_type_string = "Unified Cache"; break;
            default: cache_type_string = "Unknown Type Cache"; break;
        }

        int cache_level = (eax >>= 5) & 0x7;

        int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
        int cache_is_fully_associative = (eax >>= 1) & 0x1;


        // taken from http://download.intel.com/products/processor/manual/325462.pdf 3-166 Vol. 2A
        // ebx contains 3 integers of 10, 10 and 12 bits respectively
        unsigned int cache_sets = ecx + 1;
        unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
        unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
        unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;

        // Total cache size is the product
        size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;

        if (cache_type == 1 || cache_type == 3) {
            data_caches[num_data_caches++] = cache_total_size;
        }

        printf(
            "Cache ID %d:\n"
            "- Level: %d\n"
            "- Type: %s\n"
            "- Sets: %d\n"
            "- System Coherency Line Size: %d bytes\n"
            "- Physical Line partitions: %d\n"
            "- Ways of associativity: %d\n"
            "- Total Size: %zu bytes (%zu kb)\n"
            "- Is fully associative: %s\n"
            "- Is Self Initializing: %s\n"
            "\n"
            , i
            , cache_level
            , cache_type_string
            , cache_sets
            , cache_coherency_line_size
            , cache_physical_line_partitions
            , cache_ways_of_associativity
            , cache_total_size, cache_total_size >> 10
            , cache_is_fully_associative ? "true" : "false"
            , cache_is_self_initializing ? "true" : "false"
        );
    }

    return num_data_caches;
}

int test_cache(size_t attempts, size_t lower_cache_size, int * latencies, size_t max_latency) {
    int fd = open("/dev/urandom", O_RDONLY);
    if (fd < 0) {
        perror("open");
        abort();
    }
    char * random_data = mmap(
          NULL
        , lower_cache_size
        , PROT_READ | PROT_WRITE
        , MAP_PRIVATE | MAP_ANON // | MAP_POPULATE
        , -1
        , 0
        ); // get some random data
    if (random_data == MAP_FAILED) {
        perror("mmap");
        abort();
    }

    size_t i;
    for (i = 0; i < lower_cache_size; i += sysconf(_SC_PAGESIZE)) {
        random_data[i] = 1;
    }


    int64_t random_offset = 0;
    while (attempts--) {
        // use processor clock timer for exact measurement
        random_offset += rand();
        random_offset %= lower_cache_size;
        int32_t cycles_used, edx, temp1, temp2;
        asm (
            "mfence\n\t"        // memory fence
            "rdtsc\n\t"         // get cpu cycle count
            "mov %%edx, %2\n\t"
            "mov %%eax, %3\n\t"
            "mfence\n\t"        // memory fence
            "mov %4, %%al\n\t"  // load data
            "mfence\n\t"
            "rdtsc\n\t"
            "sub %2, %%edx\n\t" // substract cycle count
            "sbb %3, %%eax"     // substract cycle count
            : "=a" (cycles_used)
            , "=d" (edx)
            , "=r" (temp1)
            , "=r" (temp2)
            : "m" (random_data[random_offset])
            );
        // printf("%d\n", cycles_used);
        if (cycles_used < max_latency)
            latencies[cycles_used]++;
        else 
            latencies[max_latency - 1]++;
    }

    munmap(random_data, lower_cache_size);

    return 0;
} 

int main() {
    size_t cache_sizes[32];
    int num_data_caches = i386_cpuid_caches(cache_sizes);

    int latencies[0x400];
    memset(latencies, 0, sizeof(latencies));

    int empty_cycles = 0;

    int i;
    int attempts = 1000000;
    for (i = 0; i < attempts; i++) { // measure how much overhead we have for counting cyscles
        int32_t cycles_used, edx, temp1, temp2;
        asm (
            "mfence\n\t"        // memory fence
            "rdtsc\n\t"         // get cpu cycle count
            "mov %%edx, %2\n\t"
            "mov %%eax, %3\n\t"
            "mfence\n\t"        // memory fence
            "mfence\n\t"
            "rdtsc\n\t"
            "sub %2, %%edx\n\t" // substract cycle count
            "sbb %3, %%eax"     // substract cycle count
            : "=a" (cycles_used)
            , "=d" (edx)
            , "=r" (temp1)
            , "=r" (temp2)
            :
            );
        if (cycles_used < sizeof(latencies) / sizeof(*latencies))
            latencies[cycles_used]++;
        else 
            latencies[sizeof(latencies) / sizeof(*latencies) - 1]++;

    }

    {
        int j;
        size_t sum = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum += latencies[j];
        }
        size_t sum2 = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum2 += latencies[j];
            if (sum2 >= sum * .75) {
                empty_cycles = j;
                fprintf(stderr, "Empty counting takes %d cycles\n", empty_cycles);
                break;
            }
        }
    }

    for (i = 0; i < num_data_caches; i++) {
        test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies));

        int j;
        size_t sum = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum += latencies[j];
        }
        size_t sum2 = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum2 += latencies[j];
            if (sum2 >= sum * .75) {
                fprintf(stderr, "Cache ID %i has latency %d cycles\n", i, j - empty_cycles);
                break;
            }
        }

    }

    return 0;

}

Core2Duo 上的输出：

Cache ID 0:
- Level: 1
- Type: Data Cache
- Total Size: 32768 bytes (32 kb)

Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Total Size: 32768 bytes (32 kb)

Cache ID 2:
- Level: 2
- Type: Unified Cache
- Total Size: 262144 bytes (256 kb)

Cache ID 3:
- Level: 3
- Type: Unified Cache
- Total Size: 3145728 bytes (3072 kb)

Empty counting takes 90 cycles
Cache ID 0 has latency 6 cycles
Cache ID 2 has latency 21 cycles
Cache ID 3 has latency 168 cycles

【讨论】：

你能写出你是如何编译它的吗？我得到error: 'asm' operand has impossible constraints
在 Core2 上的延迟应该是 L1 的 3 个周期，L2 的 15 个周期；对于 Nehalem - L1 是 4 个周期，L2 是 11，L3 是 39 - anandtech.com/show/2542/5 - 根据 CPU-Z 测试 - 有工具的 Windows 二进制文件 cpuid.com/medias/files/softwares/misc/latency.zip 对于 AMD，L2 的典型延迟是 12-20 个周期 - anandtech.com/show/2139/3 类似的测试 lat_mem_rd 包含在 lmbench stackoverflow.com/q/19899087/196561
@Leeor 我正在度假，很抱歉回答迟了。您使用的是什么编译器，您的目标系统是什么？我可以用clang 5.0、gcc 4.8 和icc 14.0.1 为x86_64 通用目标编译它而没有错误。尝试更新你的编译器。
gcc 4.8.0 给出：error: 'asm' operand has impossible constraints。 icc 13.1.3（没有 14）给出：catastrophic error: can't allocate registers for asm instruction
这对我来说是段错误。我发现我需要将 asm 块中的 "=a"、"=d"、"=r" 替换为 "=&a"、"=&d"、"=&r" 以获得正确的编译。 & 号告诉 gcc 不要假设它可以重用输出寄存器作为输入；可以在读取所有输入之前对其进行修改。

【解决方案4】：

缓存延迟的广泛使用的经典测试是迭代链表。它适用于现代超标量/超流水线 CPU，甚至适用于 ARM Cortex-A9+ 和 Intel Core 2/ix 等无序内核。此方法由开源 lmbench 使用 - 在测试 lat_mem_rd (man page) 和 CPU-Z 延迟测量工具中：http://cpuid.com/medias/files/softwares/misc/latency.zip（本机 Windows 二进制文件）

有来自lmbench的lat_mem_rd测试来源：https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c

主要测试是

#define ONE p = (char **)*p;
#define FIVE    ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY   TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY

void
benchmark_loads(iter_t iterations, void *cookie)
{
    struct mem_state* state = (struct mem_state*)cookie;
    register char **p = (char**)state->p[0];
    register size_t i;
    register size_t count = state->len / (state->line * 100) + 1;

    while (iterations-- > 0) {
        for (i = 0; i < count; ++i) {
            HUNDRED;
        }
    }

    use_pointer((void *)p);
    state->p[0] = (char*)p;
}

所以，在破译宏之后，我们做了很多线性运算，比如：

 p = (char**) *p;  // (in intel syntax) == mov eax, [eax]
 p = (char**) *p;
 p = (char**) *p;
 ....   // 100 times total
 p = (char**) *p;

内存上方，充满了指针，每个指向stride的元素都向前。

正如手册页http://www.bitmover.com/lmbench/lat_mem_rd.8.html所说的

基准测试作为两个嵌套循环运行。外环是步幅大小。内部循环是数组大小。对于每个数组大小，基准创建一个指针环，指向前一个步长。遍历数组是通过

 p = (char **)*p;

在 for 循环中（for 循环的开销并不重要；循环是展开的循环 1000 负载长）。循环在一百万次加载后停止。数组的大小从 512 字节到（通常）8 兆字节不等。对于小尺寸，缓存会起作用，加载速度会快得多。当绘制数据时，这一点变得更加明显。

更多关于 POWER 的详细描述可从 IBM 的 wiki 获得：Untangling memory access measurements - lat_mem_rd - Jenifer Hopper 2013 年

lat_mem_rd 测试 (http://www.bitmover.com/lmbench/lat_mem_rd.8.html) 采用两个参数，一个以 MB 为单位的数组大小和一个步幅大小。基准测试使用两个循环遍历数组，使用步幅作为增量，通过创建一个指向前一个步幅的指针环。该测试以纳秒为单位测量内存大小范围内的内存读取延迟。输出由两列组成：第一列是以 MB 为单位的数组大小（浮点值），第二列是数组所有点的加载延迟。当结果绘制成图表时，您可以清楚地看到整个内存层次结构的相对延迟，包括每个缓存级别的更快延迟，以及主内存延迟。

PS：有来自英特尔的论文（感谢Eldar Abusalimov），其中包含运行 lat_mem_rd 的示例：ftp://download.intel.com/design/intarch/PAPERS/321074.pdf - 抱歉，正确的网址是 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-cache-latency-bandwidth-paper.pdf Joshua Ruggiero 于 2008 年 12 月撰写的“测量缓存和内存延迟以及 CPU 到内存带宽 - 用于英特尔架构”：

【讨论】：

最新 pdf element14.com/community/servlet/JiveServlet/previewBody/… 的新链接 - “测量缓存和内存延迟以及 CPU 到内存带宽” - “用于英特尔® 架构” - 2008
上一个 pdf 的新链接 csit-sun.pub.ro/~cpop/Documentatie_SMP/…
您好，我想知道存储到主内存所需的时间（所有缓存都未命中）。您是否认为它等于从主内存加载所需的时间？后者是从 lat_mem_rd 程序报告的，所以我已经知道了。
blaze9，是的，存储到内存应该接近（但不总是相等）以从内存中读取时间。由于使用了写入策略，它可能会更长一些（people.cs.pitt.edu/~xianeizhang/notes/cache.html#cache-write en.wikipedia.org/wiki/Cache_(computing)#WRITEPOLICIES）；和完整的高速缓存线写入是独立的，并且可能通过并行化更快。由于 DRAM 的工作方式，RAM 有数十个 cpu 时钟和 50-100 ns 延迟 - 7-cpu.com/cpu/Haswell.html 或 7-cpu.com/cpu/Skylake.html。您可以提出更多详细信息的新问题。

【解决方案5】：

好的，你的代码有几个问题：

正如您所提到的，您的测量需要很长时间。事实上，它们很可能比单次访问本身花费的时间更长，因此它们没有衡量任何有用的东西。为了缓解这种情况，访问多个元素并摊销（将总时间除以访问次数。请注意，要测量延迟，您希望这些访问被序列化，否则它们可以并行执行，您将只测量吞吐量不相关的访问。要实现这一点，您可以在访问之间添加一个错误的依赖关系。

例如，将数组初始化为零，然后执行：
```
clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
for (int i = 0; i < NUM_ACCESSES; ++i) {
    int tmp = arrayAccess[index];                             //Access Value from Main Memory
    index = (index + i + tmp) & 1023;   
}
clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
```
.. 当然记得将时间除以NUM_ACCESSES.
现在，我故意使索引变得复杂，以便您避免可能触发预取器的固定步幅（有点矫枉过正，您不太可能注意到影响，但为了演示......）。您可能会满足于一个简单的index += 32，它会给您128k（两个缓存行）的跨度，并避免大多数简单的相邻行/简单流预取器的“好处”。我还用& 1023 替换了% 1000，因为& 更快，但它需要2 的幂才能以相同的方式工作 - 所以只需将ACCESS_SIZE 增加到1024，它应该可以工作。
通过加载其他东西使 L1 无效是好的，但尺寸看起来很有趣。您没有指定您的系统，但256000 对于 L1 来说似乎相当大。在许多常见的现代 x86 CPU 上，L2 通常为 256k，例如另请注意，256k 不是 256000，而是256*1024=262144。第二种大小也是如此：1M 不是1024000，而是1024*1024=1048576。假设这确实是您的 L2 大小（更可能是 L3，但可能太小了）。
您的无效数组的类型为int，因此每个元素都比单个字节长（很可能是 4 个字节，具体取决于系统）。您实际上是在使 L1_CACHE_SIZE*sizeof(int) 的字节无效（L2 无效循环也是如此）

更新：

memset 接收字节大小，你的大小除以sizeof(int)
您的失效读取永远不会被使用，并且可能会被优化掉。尽量把reads累加到某个值，最后打印出来，避免这种可能性。
一开始的 memset 也在访问数据，因此您的第一个循环是从 L3 访问数据（因为其他 2 个 memset 仍然有效地将其从 L1+L2 逐出，尽管只是部分原因尺寸错误。
步幅可能太小，因此您可以两次访问同一高速缓存行（L1 命中）。通过添加 32 个元素（x4 字节）确保它们足够分散 - 这是 2 个缓存线，因此您也不会获得任何相邻的缓存线预取好处。

由于 NUM_ACCESSES 大于 ACCESS_SIZE，因此您实际上是在重复相同的元素，并且可能会为它们获得 L1 命中（因此平均时间转移有利于 L1 访问延迟）。而是尝试使用 L1 大小，这样您就可以只访问整个 L1（除了跳过）一次。例如像这样——

index = 0;
while (index < L1_CACHE_SIZE) {
    int tmp = arrayAccess[index];               //Access Value from L2
    index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
    count++;                                           //divide overall time by this 
}

不要忘记将arrayAccess 增加到 L1 大小。

现在，通过上述更改（或多或少），我得到了这样的结果：

L1 Cache Access 7.812500
L2 Cache Acces 15.625000
L3 Cache Access 23.437500

这看起来还是有点长，但可能是因为它包含了对算术运算的额外依赖

【讨论】：

非常棒的见解，我一定会看看你提到的一些观点。至于我的缓存大小，是的，我的 L1 是 256k（非统一）L2 是统一的 1024k，L3 是统一的 6433k。
@PandaRaid，那是哪个系统？
Extreme i7，我可能是错的，因为我没有从 intels 视线中读取实际规格，但这些是我从“dmidecode -t cache”命令得到的数据
奇怪，我认为 i7 的 L1/L2 与主流的风格不同，我认为只有 L3 可以调整到高/低结束倾斜。我认为你有 linux - /proc/cpuinfo 说什么？
cpuinfo 中的缓存大小似乎只报告了与 dmidecode 的输出相匹配的 L3 大小。我同意 L1/L2 看起来相当大（尤其是 L1，因为它在数据和指令缓存之间有 512k）。