你是 18 世纪的卡尔·弗里德里希·高斯,你的小学老师决定用一个需要大量或重复算术的家庭作业来惩罚全班同学。在前一周,你的老师告诉你将前 100 个计数数字相加,因为你很聪明,所以你想出了with a quick solution。你的老师不喜欢这样,所以他提出了一个他认为无法优化的新问题。用你自己的符号你重写这个问题
a[0] = b[0];
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];
那你就明白了
a0 = b[0]
a1 = (b[0]) + b[1];
a2 = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]
再次使用您的符号,您将问题改写为
int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;
如何优化这个?首先你定义
int sum(n0, n) {
int sum = 0;
for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
return sum;
}
然后你意识到
a_n+1 = sum(0, n) + sum(n, n+1)
a_n+2 = sum(0, n) + sum(n, n+2)
a_n+m = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)
所以现在你知道该怎么做了。找t同学。让每个人处理数字的一个子集。为简单起见,您选择 size 是 100 和四个同学 t0, t1, t2, t3 然后每个人都这样做
t0 t1 t2 t3
s0 = sum(0,25) s1 = sum(25,50) s2 = sum(50,75) s3 = sum(75,100)
同时。然后定义
fix(int n0, int n, int offset) {
for(int i=n0; i<n; i++) a[i] += offset
}
然后每个同学像这样再次同时返回他们的子集
t0 t1 t2 t3
fix(0, 25, 0) fix(25, 50, s0) fix(50, 75, s0+s1) fix(75, 100, s0+s1+s2)
您意识到 t 同学花费大约相同的 K 秒来做算术,您可以在 2*K*size/t 秒内完成这项工作,而一个人需要 K*size 秒。很明显,您至少需要两个同学才能收支平衡。因此,与四个同学一起,他们应该以一个同学的一半时间完成。
现在你用你自己的符号写下你的算法
int *suma; // array of partial results from each classmate
#pragma omp parallel
{
int ithread = omp_get_thread_num(); //label of classmate
int nthreads = omp_get_num_threads(); //number of classmates
#pragma omp single
suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;
//now have each classmate calculate their partial result s = sum(n0, n)
int s = 0;
#pragma omp for schedule(static) nowait
for (int i=0; i<size; i++) s += b[i], a[i] = sum;
suma[ithread+1] = s;
//now wait for each classmate to finish
#pragma omp barrier
// now each classmate sums each of the previous classmates results
int offset = 0;
for(int i=0; i<(ithread+1); i++) offset += suma[i];
//now each classmates corrects their result
#pragma omp for schedule(static)
for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)
您意识到您可以优化每个同学必须添加前一个同学的结果的部分,但由于size >> t 您认为这不值得付出努力。
您的解决方案不如您添加计数的解决方案快,但是您的老师对他的几个学生比其他学生完成得早得多感到不高兴。所以现在他决定一个学生必须慢慢地向全班阅读b数组,当你报告结果a时,它也必须慢慢地完成。你称之为读/写带宽限制。 This severely limits the effectiveness of your algorithm.你现在要做什么?
The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.