【问题标题】:Memory alignment for SSE in C++, _aligned_malloc equivalent?C ++中SSE的内存对齐,_aligned_malloc等效?
【发布时间】:2014-04-20 14:33:07
【问题描述】:

我想知道如何将这段 C 代码转换为 C++ 以进行内存对齐。

float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16);

我确实看过 here 然后我尝试了这个 float *pResult = (float*) __attribute__((aligned(16)));

还有这个

float *pResult = __attribute__((aligned(16)));

但两者都给出了类似的错误。

error: expected primary-expression before '__attribute__'|
error: expected ',' or ';' before '__attribute__'|

完整代码

#include "stdafx.h"
#include <xmmintrin.h>  // Need this for SSE compiler intrinsics
#include <math.h>       // Needed for sqrt in CPU-only version
#include "stdio.h"

int main(int argc, char* argv[])
{
    printf("Starting calculation...\n");

    const int length = 64000;

    // We will be calculating Y = Sin(x) / x, for x = 1->64000

    // If you do not properly align your data for SSE instructions, you may take a huge performance hit.
    float *pResult = (float*) __attribute__((aligned(16))); // align to 16-byte for SSE
    __m128 x;
    __m128 xDelta = _mm_set1_ps(4.0f);      // Set the xDelta to (4,4,4,4)
    __m128 *pResultSSE = (__m128*) pResult;


    const int SSELength = length / 4;

    for (int stress = 0; stress < 100000; stress++) // lots of stress loops so we can easily use a stopwatch
    {
#define TIME_SSE    // Define this if you want to run with SSE
#ifdef TIME_SSE
        x = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f); // Set the initial values of x to (4,3,2,1)
        for (int i=0; i < SSELength; i++)
        {
            __m128 xSqrt = _mm_sqrt_ps(x);
            // Note! Division is slow. It's actually faster to take the reciprocal of a number and multiply
            // Also note that Division is more accurate than taking the reciprocal and multiplying

#define USE_DIVISION_METHOD
#ifdef USE_FAST_METHOD
            __m128 xRecip = _mm_rcp_ps(x);
            pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt);
#endif //USE_FAST_METHOD
#ifdef USE_DIVISION_METHOD
            pResultSSE[i] = _mm_div_ps(xSqrt, x);
#endif  // USE_DIVISION_METHOD

            // NOTE! Sometimes, the order in which things are done in SSE may seem reversed.
            // When the command above executes, the four floating elements are actually flipped around
            // We have already compensated for that flipping by setting the initial x vector to (4,3,2,1) instead of (1,2,3,4)

            x = _mm_add_ps(x, xDelta);  // Advance x to the next set of numbers
        }
#endif  // TIME_SSE
#ifndef TIME_SSE
        float xFloat = 1.0f;
        for (int i=0 ; i < length; i++)
        {
            pResult[i] = sqrt(xFloat) / xFloat; // Even though division is slow, there are no intrinsic functions like there are in SSE
            xFloat += 1.0f;
        }
#endif  // !TIME_SSE
    }

    // To prove that the program actually worked
    for (int i=0; i < 20; i++)
    {
        printf("Result[%d] = %f\n", i, pResult[i]);
    }

    // Results for my particular system
    // 23.75 seconds for SSE with reciprocal/multiplication method
    // 38.5 seconds for SSE with division method
    // 301.5 seconds for CPU

    return 0;
}

【问题讨论】:

  • 您可以将 _aligned_malloc 用于 C 或 C++,但请注意它是 Microsoft 特定的。
  • 我使用的是 GCC,但还是 mingw?
  • 在这种情况下,请使用更便携的memalign/posix_memalign 系列调用 - StackOverflow 上已经有几个相关问题有很好的答案。
  • 我以为那些是 C 语言的?
  • 当然——但这并不妨碍你在 C++ 中使用它们。

标签: c++ g++ malloc sse memory-alignment


【解决方案1】:

对于 C++11,你可以使用类似的东西:

struct aligned_float
{
    alignas(16) float f[4];
};

static_assert(sizeof(aligned_float) == 4 * sizeof(float), "padding issue");

int main()
{
    const int length = 64000;
    std::vector<aligned_float> pResult(length / sizeof(aligned_float));

    return 0;
}

【讨论】:

  • std::vector&lt;T&gt;new 实际上并不遵守过度对齐的对齐要求(至少在 C++11 或 C++14 中不会。以后可能会更改)。这恰好适用于大多数 x86-64 系统,因为分配器已经默认返回 16B 对齐的内存。
【解决方案2】:

对齐的属性仅适用于事物的编译/链接方式。它没有运行时影响。

我知道解决这个问题的唯一可移植方法是使用一个包装器,该包装器实际上分配了超出必要的部分,并屏蔽了低位以确保它返回的内容满足足够的对齐。

【讨论】:

  • 我想我只是 GCC 需要它。最简单的答案就足够了。
【解决方案3】:

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-06-11
    • 2011-06-25
    • 1970-01-01
    • 1970-01-01
    • 2018-10-12
    • 2017-06-02
    • 2012-11-13
    • 2015-03-11
    相关资源
    最近更新 更多