【发布时间】:2019-02-04 14:22:41
【问题描述】:
我编写了以下代码,它首先刷新两个数组元素,然后尝试读取元素以测量命中/未命中延迟。
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
/* create array */
int array[ 100 ];
int i;
for ( i = 0; i < 100; i++ )
array[ i ] = i; // bring array to the cache
uint64_t t1, t2, ov, diff1, diff2, diff3;
/* flush the first cache line */
_mm_lfence();
_mm_clflush( &array[ 30 ] );
_mm_clflush( &array[ 70 ] );
_mm_lfence();
/* READ MISS 1 */
_mm_lfence(); // fence to keep load order
t1 = __rdtsc(); // set start time
_mm_lfence();
int tmp = array[ 30 ]; // read the first elemet => cache miss
_mm_lfence();
t2 = __rdtsc(); // set stop time
_mm_lfence();
diff1 = t2 - t1; // two fence statements are overhead
printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );
/* READ MISS 2 */
_mm_lfence(); // fence to keep load order
t1 = __rdtsc(); // set start time
_mm_lfence();
tmp = array[ 70 ]; // read the second elemet => cache miss (or hit due to prefetching?!)
_mm_lfence();
t2 = __rdtsc(); // set stop time
_mm_lfence();
diff2 = t2 - t1; // two fence statements are overhead
printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );
/* READ HIT*/
_mm_lfence(); // fence to keep load order
t1 = __rdtsc(); // set start time
_mm_lfence();
tmp = array[ 30 ]; // read the first elemet => cache hit
_mm_lfence();
t2 = __rdtsc(); // set stop time
_mm_lfence();
diff3 = t2 - t1; // two fence statements are overhead
printf( "tmp is %d\ndiff3 is %lu\n", tmp, diff3 );
/* measuring fence overhead */
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
ov = t2 - t1;
printf( "lfence overhead is %lu\n", ov );
printf( "cache miss1 TSC is %lu\n", diff1-ov );
printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
printf( "cache hit TSC is %lu\n", diff3-ov );
return 0;
}
输出是
# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12
读取array[70] 的输出存在一些问题。 TSC 既没有被击中也没有被击中。我已经刷新了类似于array[30] 的那个项目。一种可能性是,当访问array[40] 时,硬件预取器会带来array[70]。所以,这应该是一个打击。然而,TSC 远不止是一击。当我第二次尝试读取array[30] 时,您可以验证命中 TSC 约为 20。
即使array[70] 没有被预取,TSC 也应该类似于缓存未命中。
有什么原因吗?
更新1:
为了读取数组,我按照 Peter 和 Hadi 的建议尝试了 (void) *((int*)array+i)。
在输出中,我看到了许多负面结果。我的意思是开销似乎大于(void) *((int*)array+i)
更新2:
我忘记添加volatile。结果现在很有意义。
【问题讨论】:
-
编译器可能不会打扰从数组中读取,因为它不是
volatile并且没有使用该值(优化器会/应该完全忽略它);并且lfence的成本取决于周围的代码(例如,当时有多少负载在飞行),并且无法在一组条件下测量,并且在不同的一组条件下假设是相同的。 -
是的。我忘了加
volatile。谢谢。
标签: c performance x86 cpu-architecture tsc