【发布时间】:2013-10-10 22:31:37
【问题描述】:
您好,我正在尝试运行一个程序,该程序使用蛮力和缓存技术(例如此处的 pdf)找到最接近的配对:Caching Performance Stanford
我原来的代码是:
float compare_points_BF(int N,point *P){
int i,j;
float distance=0, min_dist=FLT_MAX;
point *p1, *p2;
unsigned long long calc = 0;
for (i=0;i<(N-1);i++){
for (j=i+1;j<N;j++){
if ((distance = (P[i].x - P[j].x) * (P[i].x - P[j].x) +
(P[i].y - P[j].y) * (P[i].y - P[j].y)) < min_dist){
min_dist = distance;
p1 = &P[i];
p2 = &P[j];
}
}
}
return sqrt(min_dist);
}
这个程序给出了大约这些运行时间:
N 8192 16384 32768 65536 131072 262144 524288 1048576
seconds 0,070 0,280 1,130 5,540 18,080 72,838 295,660 1220,576
0,080 0,330 1,280 5,190 20,290 80,880 326,460 1318,631
上述程序的缓存版本为:
float compare_points_BF(register int N, register int B, point *P){
register int i, j, ib, jb, num_blocks = (N + (B-1)) / B;
register point *p1, *p2;
register float distance=0, min_dist=FLT_MAX, regx, regy;
//break array data in N/B blocks, ib is index for i cached block and jb is index for j strided cached block
//each i block is compared with the j block, (which j block is always after the i block)
for (i = 0; i < num_blocks; i++){
for (j = i; j < num_blocks; j++){
//reads the moving frame block to compare with the i cached block
for (jb = j * B; jb < ( ((j+1)*B) < N ? ((j+1)*B) : N); jb++){
//avoid float comparisons that occur when i block = j block
//Register Allocated
regx = P[jb].x;
regy = P[jb].y;
for (i == j ? (ib = jb + 1) : (ib = i * B); ib < ( ((i+1)*B) < N ? ((i+1)*B) : N); ib++){
//calculate distance of current points
if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
(P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
min_dist = distance;
p1 = &P[ib];
p2 = &P[jb];
}
}
}
}
}
return sqrt(min_dist);
}
还有一些结果:
Block_size = 256 N = 8192 Run time: 0.090 sec
Block_size = 512 N = 8192 Run time: 0.090 sec
Block_size = 1024 N = 8192 Run time: 0.090 sec
Block_size = 2048 N = 8192 Run time: 0.100 sec
Block_size = 4096 N = 8192 Run time: 0.090 sec
Block_size = 8192 N = 8192 Run time: 0.090 sec
Block_size = 256 N = 16384 Run time: 0.357 sec
Block_size = 512 N = 16384 Run time: 0.353 sec
Block_size = 1024 N = 16384 Run time: 0.360 sec
Block_size = 2048 N = 16384 Run time: 0.360 sec
Block_size = 4096 N = 16384 Run time: 0.370 sec
Block_size = 8192 N = 16384 Run time: 0.350 sec
Block_size = 16384 N = 16384 Run time: 0.350 sec
Block_size = 128 N = 32768 Run time: 1.420 sec
Block_size = 256 N = 32768 Run time: 1.420 sec
Block_size = 512 N = 32768 Run time: 1.390 sec
Block_size = 1024 N = 32768 Run time: 1.410 sec
Block_size = 2048 N = 32768 Run time: 1.430 sec
Block_size = 4096 N = 32768 Run time: 1.430 sec
Block_size = 8192 N = 32768 Run time: 1.400 sec
Block_size = 16384 N = 32768 Run time: 1.380 sec
Block_size = 256 N = 65536 Run time: 5.760 sec
Block_size = 512 N = 65536 Run time: 5.790 sec
Block_size = 1024 N = 65536 Run time: 5.720 sec
Block_size = 2048 N = 65536 Run time: 5.720 sec
Block_size = 4096 N = 65536 Run time: 5.720 sec
Block_size = 8192 N = 65536 Run time: 5.530 sec
Block_size = 16384 N = 65536 Run time: 5.550 sec
Block_size = 256 N = 131072 Run time: 22.750 sec
Block_size = 512 N = 131072 Run time: 23.130 sec
Block_size = 1024 N = 131072 Run time: 22.810 sec
Block_size = 2048 N = 131072 Run time: 22.690 sec
Block_size = 4096 N = 131072 Run time: 22.710 sec
Block_size = 8192 N = 131072 Run time: 21.970 sec
Block_size = 16384 N = 131072 Run time: 22.010 sec
Block_size = 256 N = 262144 Run time: 90.220 sec
Block_size = 512 N = 262144 Run time: 92.140 sec
Block_size = 1024 N = 262144 Run time: 91.181 sec
Block_size = 2048 N = 262144 Run time: 90.681 sec
Block_size = 4096 N = 262144 Run time: 90.760 sec
Block_size = 8192 N = 262144 Run time: 87.660 sec
Block_size = 16384 N = 262144 Run time: 87.760 sec
Block_size = 256 N = 524288 Run time: 361.151 sec
Block_size = 512 N = 524288 Run time: 379.521 sec
Block_size = 1024 N = 524288 Run time: 379.801 sec
从我们可以看到运行时间比非缓存代码慢。 这是由于编译器优化吗?代码是坏的还是仅仅是因为算法在平铺方面表现不佳?我使用用 32 位可执行文件编译的 VS 2010。提前致谢!
【问题讨论】:
-
你认为你的 CPU 有多少个寄存器?
-
@SJuan76 我的 cpu 是
i7 980x. It has 32KB L1 data, 32KB L1 instruction per core L2 cache: 256KB per core, inclusive L3 cache: 12MB accessible by all cores, inclusive.如果您想知道,我确实尝试过使用代码中的register符号减少块大小和变量,但仍然没有更好的性能。这可能是由于寄存器溢出吗? -
register在现代编译器上没有任何作用。编译器比你更清楚,而且大多会忽略它。 -
您正在使用一篇介于登月和今天之间的论文进行当时有效的特定优化,并尝试将其应用于现代架构。我并不是说这一切都不适用,但我会小心的。至于具体建议?我很难解码您的代码,但是带有所有奇怪比较和分支的内部循环看起来并不健康。我希望您知道您的编译器实际上并没有在您运行时编译代码,因此单行实际上并不比正确格式化和拆分代码更快。
-
ulrich drepper on memory 大约有六年的历史,但仍然与现代商品 cpus 相关。
标签: c++ c caching visual-c++ tiling