因为这是一个作业,你可能想了解向量化过程,我不提供可以编译的源代码(你应该在我回答后做一些编码)。希望您能够自己解决。
//The loop counter should be suitable for Vectorization Factor (VF)
//In this case VF=4 (assume your processor has 128-bit SIMD register and data are 32-bit.
//1757×4 = 7028 --> you will have 2 values that can not be put in vectos or you must pad the array to fit the vector.
for (i = 0; i < 7028; i+=4) {
a[7031 * i + 703] = b[i] * c[i];
a[7031 * (i+1) + 703] = b[i+1] * c[i+1];
a[7031 * (i+2) + 703] = b[i+2] * c[i+2];
a[7031 * (i+3) + 703] = b[i+3] * c[i+3];
}
a[7031 * i + 703] = b[i] * c[i];
i++;
a[7031 * i + 703] = b[i] * c[i];
//vec_b = (b[i], b[i+1], b[i+2], b[i+3]); // are adjacent -> thus can be loaded
//vec_c = (c[i], c[i+1], c[i+2], c[i+3]); // are adjacent -> thus can be loaded
//index = 7031*i + 703
//vec_a = (a[index], a[index + 7031], a[index + 7031*2], a[index + 7031*3]; //not adjacent!
vec_b = __mm_loadu_ps(&b[i]); 将向量从相邻元素加载到vec_c 的向量中,您也可以像这样使用从相邻元素intrinsic instruction 加载的加载指令。但关键是您应该将数据存储到非继续地址。如果处理器支持AVX-512,您可能可以使用scatter 指令将向量存储到非连续地址。
如果您没有scatter 指令,您可能需要提取元素并将它们放在不同的目标地址中。 _mm_extract_epi32 或 _mm_cvtss_f32 和 shift 等。
for (i = 0; i < 7030; i++) {
d[i] = a[7031 * i + 703 * 7030] + e;
}
再次需要矢量化,并且您需要了解数据位置:
Index = 7031 * i + 703 * 7030
for (i = 0; i < 7028; i+=4) {
d[i] = a[Index] + e;
d[i+1] = a[Index + 7031] + e;
d[i+2] = a[Index + 7031*2] + e;
d[i+3] = a[Index + 7031*3] + e;
}
//extra computations for i = 7028, 7029;
//vec_a = (a[Index], a[Index + 7031], a[Index + 7031*2], a[Index + 7031*3])
//vec_a can be loaded with _mm_set_ps (a3, a2, a1, a0), etc but `gather` instruction is also use full to load from different addresses.
//vec_e = (e, e, e, e) : you can use _mm_set_ps1, _mm_set1...
最后如何乘法或加法?轻松使用向量运算
vec_a = _mm_mul_ps(vec_b, vec_c);
vec_d = _mm_add_ps(vec_a, vec_e);
以及如何将向量存储到继续的地方?
_mm_store_ps(d[i],vec_d); //i=i+4 for the next store I mean your loop counter must be appropriate.
因此,对于循环向量化,您可以使用内部函数作为显式向量化,也可以依赖隐式向量化,例如在 -O3 优化级别使用 gcc/clang 或启用适当的标志gcc -ftree-vectorize -ftree-slp-vectorize