【发布时间】:2014-06-15 02:44:30
【问题描述】:
我的任务是计算数组中字节的异或和:
X = char1 XOR char2 XOR char3 ... charN;
我正在尝试并行化它,而不是异或 __m128。这应该给出加速因子 4。 此外,要重新检查我使用 int 的算法。这应该给出加速因子 4。 测试程序有 100 行长,我不能让它更短,但很简单:
#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>
#include <iostream>
using namespace std;
#include <stdlib.h> // rand
const int NIter = 100;
const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));
template< typename T >
T Sum(const T* data, const int N)
{
T sum = 0;
for ( int i = 0; i < N; ++i )
sum = sum ^ data[i];
return sum;
}
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum = _mm_set_ps1(0);
for ( int i = 0; i < N; ++i )
sum = _mm_xor_ps(sum,data[i]);
return sum;
}
int main() {
// fill string by random values
for( int i = 0; i < N; i++ ) {
str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
}
/// -- CALCULATE --
/// SCALAR
unsigned char sumS = 0;
std::clock_t c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ )
sumS = Sum<unsigned char>( str, N );
double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SIMD
unsigned char sumV = 0;
const int m128CharLen = 4*4;
const int NV = N/m128CharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
__m128 sumVV = _mm_set_ps1(0);
sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);
sumV = sumVS[0];
for ( int iE = 1; iE < m128CharLen; ++iE )
sumV ^= sumVS[iE];
}
double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SCALAR INTEGER
unsigned char sumI = 0;
const int intCharLen = 4;
const int NI = N/intCharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);
sumI = sumIS[0];
for ( int iE = 1; iE < intCharLen; ++iE )
sumI ^= sumIS[iE];
}
double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// -- OUTPUT --
cout << "Time scalar: " << tScal << " ms " << endl;
cout << "Time INT: " << tINT << " ms, speed up " << tScal/tINT << endl;
cout << "Time SIMD: " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;
if(sumV == sumS && sumI == sumS )
std::cout << "Results are the same." << std::endl;
else
std::cout << "ERROR! Results are not the same." << std::endl;
return 1;
}
典型结果:
[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 290 ms, speed up 12.5517
Results are the same.
如您所见,int 版本运行理想,但 simd 版本损失 25% 的速度,这是稳定的。我试图改变数组大小,这没有帮助。
另外,如果我切换到 -O2,我会在 simd 版本中失去 75% 的速度:
[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 900 ms, speed up 4.04444
Time SIMD: 880 ms, speed up 4.13636
Results are the same.
谁能解释一下?
附加信息:
我有 g++ (GCC) 4.7.3; Intel(R) Xeon(R) CPU E7-4860
-
我使用 -fno-tree-vectorize 来防止自动矢量化。如果没有带有 -O3 的标志,则 预期加速为 1,因为任务很简单。这是我得到的:
[10:55:40]$ g++ test.cpp -O3; ./a.out Time scalar: 270 ms Time INT: 270 ms, speed up 1 Time SIMD: 280 ms, speed up 0.964286 Results are the same.但使用 -O2 的结果仍然很奇怪:
[10:55:02]$ g++ test.cpp -O2; ./a.out Time scalar: 3540 ms Time INT: 990 ms, speed up 3.57576 Time SIMD: 880 ms, speed up 4.02273 Results are the same. -
当我改变时
for ( int i = 0; i < N; i+=1 ) sum = sum ^ data[i];相当于:
for ( int i = 0; i < N; i+=8 ) sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;我确实看到标量速度提高了 2 倍。但我没有看到加速方面的改进。之前:intSpeedUp 3.98416,SIMDSpeedUP 12.5283。之后:intSpeedUp 3.5572,SIMDSpeedUP 6.8523。
【问题讨论】:
-
你能打开
-vec-report3标志看看循环是否真的被矢量化了 -
@arunmoezhi,你是什么意思?哪些循环必须矢量化?我的 gcc 无法识别 -vec-report3。
-
标量版本。编译器为什么不优化呢
-
@arunmoezhi,因为 -fno-tree-vectorize 标志。
-
试试
_mm_load_si128?
标签: c++ performance parallel-processing simd seeding