使用 AVX 最快实现指数函数答案

【问题标题】：Fastest Implementation of Exponential Function Using AVX使用 AVX 最快实现指数函数
【发布时间】：2018-07-29 13:02:55
【问题描述】：

我正在寻找在 AVX 元素（单精度浮点）上运行的指数函数的有效（快速）近似值。即 - __m256 _mm256_exp_ps( __m256 x ) 没有 SVML。

相对准确度应该类似于 ~1e-6，或 ~20 尾数位（2^20 中的 1 部分）。

如果它是用 C 风格和英特尔内在函数编写的，我会很高兴。
代码应该是可移植的（Windows、macOS、Linux、MSVC、ICC、GCC 等...）。

这与Fastest Implementation of Exponential Function Using SSE 类似，但该问题的查找速度非常快且精度低（当前答案的精度约为 1e-3）。

另外，这个问题正在寻找 AVX / AVX2（和 FMA）。但请注意，这两个问题的答案很容易在 SSE4 __m128 或 AVX2 __m256 之间移植，因此未来的读者应该根据所需的精度/性能权衡来选择。

【问题讨论】：

vml 应该没问题：bitbucket.org/eschnett/vecmathlib/wiki/Home
看看avx_mathfun的AVX2优化exp函数。
@Royi 为什么你不能移动你的 SEE 和 AVX 函数来分离源文件并用-msse2 编译一个，用-mavx 编译另一个？
@Zboson 请注意，Agner Fog 声称 exp 在 vectormath_exp.h 中的性能很差，请参阅文档的 page 43。 avx_mathfun 的一个优点是它使用类似于 Chebyshev 近似的多项式而不是 VCL 使用的泰勒展开式。所以avx_mathfun应该比VCL在性能和精度之间有更好的平衡。
@wim，显然 GCC 仅将 exp 向量化为 double 而不是 float。奇怪godbolt.org/g/mN14F7

标签： x86 simd avx exponential avx2

【解决方案1】：

avx_mathfun 中的 exp 函数使用范围缩减与 Chebyshev 近似多项式相结合，与 AVX 指令并行计算 8 exp-s。尽可能使用正确的编译器设置来确保 addps 和 mulps 与 FMA 指令融合。

将 avx_mathfun 中的原始 exp 代码改编为可移植（跨不同编译器）C / AVX2 内在代码非常简单。原始代码使用 gcc 样式对齐属性和巧妙的宏。改用标准_mm256_set1_ps() 的修改代码位于小测试代码和表格下方。修改后的代码需要AVX2。

以下代码用于简单测试：

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

输出好像没问题：

i = 0, x = 1.000000e+00, y = 2.718282e+00 
i = 1, x = 2.000000e+00, y = 7.389056e+00 
i = 2, x = 3.000000e+00, y = 2.008554e+01 
i = 3, x = 4.000000e+00, y = 5.459815e+01 
i = 4, x = 5.000000e+00, y = 1.484132e+02 
i = 5, x = 6.000000e+00, y = 4.034288e+02 
i = 6, x = 7.000000e+00, y = 1.096633e+03 
i = 7, x = 8.000000e+00, y = 2.980958e+03

修改后的代码（AVX2）为：

#include <stdio.h>
#include <immintrin.h>
/*     gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c    */

__m256 exp256_ps(__m256 x) {
/* Modified code. The original code is here: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)
*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc. The modified code requires AVX2
*/
__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341);
__m256   cephes_exp_C1 = _mm256_set1_ps(0.693359375);
__m256   cephes_exp_C2 = _mm256_set1_ps(-2.12194440e-4);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   tmp           = _mm256_setzero_ps(), fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx    = _mm256_mul_ps(x, cephes_LOG2EF);
        fx    = _mm256_add_ps(fx, _mm256_set1_ps(0.5f));
        tmp   = _mm256_floor_ps(fx);
__m256  mask  = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS);    
        mask  = _mm256_and_ps(mask, one);
        fx    = _mm256_sub_ps(tmp, mask);
        tmp   = _mm256_mul_ps(fx, cephes_exp_C1);
__m256  z     = _mm256_mul_ps(fx, cephes_exp_C2);
        x     = _mm256_sub_ps(x, tmp);
        x     = _mm256_sub_ps(x, z);
        z     = _mm256_mul_ps(x,x);

__m256  y     = cephes_exp_p0;
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p1);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p2);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p3);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p4);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p5);
        y     = _mm256_mul_ps(y, z);
        y     = _mm256_add_ps(y, x);
        y     = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0  = _mm256_cvttps_epi32(fx);
        imm0  = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0  = _mm256_slli_epi32(imm0, 23);
__m256  pow2n = _mm256_castsi256_ps(imm0);
        y     = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

作为@Peter Cordes points out, 应该可以将_mm256_floor_ps(fx + 0.5f) 替换为 _mm256_round_ps(fx)。此外，mask = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS); 和接下来的两行似乎是多余的。通过将cephes_exp_C1 和cephes_exp_C2 组合成inv_LOG2EF，可以进一步优化。这导致下面的代码没有经过彻底的测试！

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
/*    gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c -lm     */

__m256 exp256_ps(__m256 x) {
/* Modified code from this source: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)

*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(11.0f, -12.0f, 13.0f ,-14.0f ,15.0f, -16.0f, 17.0f, -18.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

 /* compare exp256_ps with the double precision exp from math.h, 
    print the relative error             */
    printf("i      x                     y = exp256_ps(x)      double precision exp        relative error\n\n");
    for (i=0;i<8;i++){ 
        printf("i = %i  x =%16.9e   y =%16.9e   exp_dbl =%16.9e   rel_err =%16.9e\n",
           i,xv[i],yv[i],exp((double)(xv[i])),
           ((double)(yv[i])-exp((double)(xv[i])))/exp((double)(xv[i])) );
    }
    return 0;
}

下表通过将 exp256_ps 与来自 math.h 的双精度 exp 进行比较，给出了某些点的准确度印象。相对误差在最后一列。

i      x                     y = exp256_ps(x)      double precision exp        relative error

i = 0  x = 1.000000000e+00   y = 2.718281746e+00   exp_dbl = 2.718281828e+00   rel_err =-3.036785947e-08
i = 1  x =-2.000000000e+00   y = 1.353352815e-01   exp_dbl = 1.353352832e-01   rel_err =-1.289636419e-08
i = 2  x = 3.000000000e+00   y = 2.008553696e+01   exp_dbl = 2.008553692e+01   rel_err = 1.672817689e-09
i = 3  x =-4.000000000e+00   y = 1.831563935e-02   exp_dbl = 1.831563889e-02   rel_err = 2.501162103e-08
i = 4  x = 5.000000000e+00   y = 1.484131622e+02   exp_dbl = 1.484131591e+02   rel_err = 2.108215155e-08
i = 5  x =-6.000000000e+00   y = 2.478752285e-03   exp_dbl = 2.478752177e-03   rel_err = 4.380257261e-08
i = 6  x = 7.000000000e+00   y = 1.096633179e+03   exp_dbl = 1.096633158e+03   rel_err = 1.849522682e-08
i = 7  x =-8.000000000e+00   y = 3.354626242e-04   exp_dbl = 3.354626279e-04   rel_err =-1.101575118e-08

【讨论】：

为什么他们使用floor(fx+0.5) 而不是_mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)？总是向-Inf而不是通常的最近偶数舍入中途情况真的很重要吗？或者介绍weirdness for inputs like 1 ulp below 0.5。
也许写这篇文章的人不知道vroundps 的存在，并且正在移植使用floor 的标量代码。然后他们比较floor(fx+0.5) > (fx+0.5) 并有条件地从舍入结果中减去1.0？这是针对糟糕的+0.5首先引入的舍入错误的某种修复吗？实际上我看不出比较是如何正确的，不是floor( f )严格意义上的<= f吗？
无论如何，OP 想要一些快速的东西，相对误差高达 1e-6 是可以接受的，这远远不需要所有 23 个有效位正确，所以应该放弃超精确的技巧（如果那是什么他们是）。除非在指数步长之间的截止点周围存在危险区域或其他东西。
@PeterCordes 我明白了。实际上0.693359375 有一个浮点表示，末尾有 15 个零位。因此，将0.693359375 与一个小整数相乘应该是准确的，这种大/小拆分确实有助于提高准确性。
那么大概就是这样了。当然，如果 FMA 可用，则您根本不需要临时四舍五入。 :)

【解决方案2】：

由于exp() 的快速计算需要处理 IEEE-754 浮点操作数的指数字段，AVX 并不真正适合这种计算，因为它缺少整数运算。因此，我将专注于AVX2。对融合乘法加法的支持在技术上是独立于AVX2 的功能，因此我提供了两个代码路径，使用和不使用 FMA，由宏 USE_FMA 控制。

下面的代码计算exp() 到几乎所需的精度 10^-6。在这里使用 FMA 并没有提供任何显着的改进，但它应该在支持它的平台上提供性能优势。

之前answer 中用于低精度 SSE 实现的算法不能完全扩展到相当精确的实现，因为它包含一些数值属性较差的计算，但是在这种情况下并不重要。而不是计算 e^x = 2ⁱ * 2^f，f in [0,1] 或 f in [- ½, ½]，用f 在更窄的区间 [-½log 2, ½log 2]，其中log 表示自然对数。

为此，我们首先计算i = rint(x * log2(e))，然后计算f = x - log(2) * i。 重要的是，后一种计算需要使用高于的原生精度来传递准确的缩减参数以传递给核心近似值。为此，我们使用 Cody-Waite 方案，该方案首次发表于 WJ Cody & W. Waite，“基本功能软件手册”，Prentice Hall 1980。常数 log(2) 被分成较大的“高”部分幅度和一个幅度小得多的“低”部分，它保持“高”部分和数学常数之间的差异。

选择尾数中具有足够尾随零位的高位部分，以便i 与“高位”部分的乘积完全可以以原生精度表示。这里我选择了一个带有八个尾随零位的“高”部分，因为i 肯定适合八位。

本质上，我们计算 f = x - i * log(2)_high - i * log(2)_low。这个简化的参数被传递到核心近似值中，它是一个多项式minimax approximation，并且结果按上一个答案中的 2ⁱ 缩放。

#include <immintrin.h>

#define USE_FMA 0

/* compute exp(x) for x in [-87.33654f, 88.72283] 
   maximum relative error: 3.1575e-6 (USE_FMA = 0); 3.1533e-6 (USE_FMA = 1)
*/
__m256 faster_more_accurate_exp_avx2 (__m256 x)
{
    __m256 t, f, p, r;
    __m256i i, j;

    const __m256 l2e = _mm256_set1_ps (1.442695041f); /* log2(e) */
    const __m256 l2h = _mm256_set1_ps (-6.93145752e-1f); /* -log(2)_hi */
    const __m256 l2l = _mm256_set1_ps (-1.42860677e-6f); /* -log(2)_lo */
    /* coefficients for core approximation to exp() in [-log(2)/2, log(2)/2] */
    const __m256 c0 =  _mm256_set1_ps (0.041944388f);
    const __m256 c1 =  _mm256_set1_ps (0.168006673f);
    const __m256 c2 =  _mm256_set1_ps (0.499999940f);
    const __m256 c3 =  _mm256_set1_ps (0.999956906f);
    const __m256 c4 =  _mm256_set1_ps (0.999999642f);

    /* exp(x) = 2^i * e^f; i = rint (log2(e) * x), f = x - log(2) * i */
    t = _mm256_mul_ps (x, l2e);      /* t = log2(e) * x */
    r = _mm256_round_ps (t, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC); /* r = rint (t) */

#if USE_FMA
    f = _mm256_fmadd_ps (r, l2h, x); /* x - log(2)_hi * r */
    f = _mm256_fmadd_ps (r, l2l, f); /* f = x - log(2)_hi * r - log(2)_lo * r */
#else // USE_FMA
    p = _mm256_mul_ps (r, l2h);      /* log(2)_hi * r */
    f = _mm256_add_ps (x, p);        /* x - log(2)_hi * r */
    p = _mm256_mul_ps (r, l2l);      /* log(2)_lo * r */
    f = _mm256_add_ps (f, p);        /* f = x - log(2)_hi * r - log(2)_lo * r */
#endif // USE_FMA

    i = _mm256_cvtps_epi32(t);       /* i = (int)rint(t) */

    /* p ~= exp (f), -log(2)/2 <= f <= log(2)/2 */
    p = c0;                          /* c0 */
#if USE_FMA
    p = _mm256_fmadd_ps (p, f, c1);  /* c0*f+c1 */
    p = _mm256_fmadd_ps (p, f, c2);  /* (c0*f+c1)*f+c2 */
    p = _mm256_fmadd_ps (p, f, c3);  /* ((c0*f+c1)*f+c2)*f+c3 */
    p = _mm256_fmadd_ps (p, f, c4);  /* (((c0*f+c1)*f+c2)*f+c3)*f+c4 ~= exp(f) */
#else // USE_FMA
    p = _mm256_mul_ps (p, f);        /* c0*f */
    p = _mm256_add_ps (p, c1);       /* c0*f+c1 */
    p = _mm256_mul_ps (p, f);        /* (c0*f+c1)*f */
    p = _mm256_add_ps (p, c2);       /* (c0*f+c1)*f+c2 */
    p = _mm256_mul_ps (p, f);        /* ((c0*f+c1)*f+c2)*f */
    p = _mm256_add_ps (p, c3);       /* ((c0*f+c1)*f+c2)*f+c3 */
    p = _mm256_mul_ps (p, f);        /* (((c0*f+c1)*f+c2)*f+c3)*f */
    p = _mm256_add_ps (p, c4);       /* (((c0*f+c1)*f+c2)*f+c3)*f+c4 ~= exp(f) */
#endif // USE_FMA

    /* exp(x) = 2^i * p */
    j = _mm256_slli_epi32 (i, 23); /* i << 23 */
    r = _mm256_castsi256_ps (_mm256_add_epi32 (j, _mm256_castps_si256 (p))); /* r = p * 2^i */

    return r;
}

如果需要更高的精度，可以使用以下一组系数将多项式逼近的次数提高一倍：

/* maximum relative error: 1.7428e-7 (USE_FMA = 0); 1.6586e-7 (USE_FMA = 1) */
const __m256 c0 =  _mm256_set1_ps (0.008301110f);
const __m256 c1 =  _mm256_set1_ps (0.041906696f);
const __m256 c2 =  _mm256_set1_ps (0.166674897f);
const __m256 c3 =  _mm256_set1_ps (0.499990642f);
const __m256 c4 =  _mm256_set1_ps (0.999999762f);
const __m256 c5 =  _mm256_set1_ps (1.000000000f);

【讨论】：

对于这样一个低次多项式来说，这是一个相当准确的近似值！通过对my answer 中的代码进行的一些实验，我发现当不需要高精度时，Cody-Waite 技巧并不总是必要的，特别是在使用 FMA 计算缩减参数时。仅使用 AVX2（无 FMA），对于 [-84,84] 中的所有浮点数 x，第二个 exp256_ps(x) 函数（在我的答案的底部）的最大相对误差为 4.1e-6。编译器启用 FMA 后，对于 [-84,84] 中的 x，最大相对误差为 3.0e-7。
@wim 我使用了极小极大近似值，而其他代码中使用的近似值可能并非如此。我同意使用 FMA 有时可以使 Cody-Waite 技巧的使用变得不必要。我仍在试验上面代码的 FMA 增强版本。初步迹象表明，这里似乎并没有增加太多的准确性。
i = (int)r 注释与代码不匹配：cvtps_epi32 是舍入到最近的转换，而不是 C 截断为零（即 cvttps_epi32, with an extra t for truncate）。 C 没有任何好的方式来表示舍入并转换为int（lrint 返回一个long），但表示操作的最准确方式是(int) rint(r)。哦，我看到r 是roundps 的结果。通过使用 _mm256_cvtps_epi32(t) 而不是 r 来缩短该 dep 链（但不是关键路径）。
@PeterCordes 我同意你的意见。我不确定 ILP 的小幅增加会导致实际的性能差异，因为这偏离了前面提到的关键路径；我也不确定该更改对端口使用的潜在差异的影响（我几乎不像您那样精通多代 x86 CPU 的复杂细节 :-) 如果性能数据显示使用 _mm256_cvtps_epi32(t)更优秀，我很乐意应用更改。
它为 CPU 提供了提前运行它的选项，有时在处理 FP 加法和舍入指令的同一端口上有空闲周期时。这可能会导致更多或更少的资源冲突（转换从关键路径中窃取一个循环）。嗯，在 Haswell/Skylake 上，vroundps 是 2 微秒（对于 p1/p01），因此转换为整数和返回的延迟和吞吐量与 Haswell+ 上的实际 vroundps 相同。这样做可以免费为您提供四舍五入的整数结果！（而且我认为 Bulldozer / Ryzen 和 Sandybridge 的收支平衡）

【解决方案3】：

我玩了很多次，发现了这个，它的相对精度约为 1-07e，并且很容易转换为矢量指令。只有 4 个常数，5 个乘法和 1 个除法，这是内置 exp() 函数的两倍。

float fast_exp(float x)
{
    const float c1 = 0.007972914726F;
    const float c2 = 0.1385283768F;
    const float c3 = 2.885390043F;
    const float c4 = 1.442695022F;      
    x *= c4; //convert to 2^(x)
    int intPart = (int)x;
    x -= intPart;
    float xx = x * x;
    float a = x + c1 * xx * x;
    float b = c3 + c2 * xx;
    float res = (b + a) / (b - a);
    reinterpret_cast<int &>(res) += intPart << 23; // res *= 2^(intPart)
    return res;
}

转换为 AVX（更新）

__m256 _mm256_exp_ps(__m256 _x)
{
    __m256 c1 = _mm256_set1_ps(0.007972914726F);
    __m256 c2 = _mm256_set1_ps(0.1385283768F);
    __m256 c3 = _mm256_set1_ps(2.885390043F);
    __m256 c4 = _mm256_set1_ps(1.442695022F);
    __m256 x = _mm256_mul_ps(_x, c4); //convert to 2^(x)
    __m256 intPartf = _mm256_round_ps(x, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
    x = _mm256_sub_ps(x, intPartf);
    __m256 xx = _mm256_mul_ps(x, x);
    __m256 a = _mm256_add_ps(x, _mm256_mul_ps(c1, _mm256_mul_ps(xx, x))); //can be improved with FMA
    __m256 b = _mm256_add_ps(c3, _mm256_mul_ps(c2, xx));
    __m256 res = _mm256_div_ps(_mm256_add_ps(b, a), _mm256_sub_ps(b, a));
    __m256i intPart = _mm256_cvtps_epi32(intPartf); //res = 2^intPart. Can be improved with AVX2!
    __m128i ii0 = _mm_slli_epi32(_mm256_castsi256_si128(intPart), 23);
    __m128i ii1 = _mm_slli_epi32(_mm256_extractf128_si256(intPart, 1), 23);     
    __m128i res_0 = _mm_add_epi32(ii0, _mm256_castsi256_si128(_mm256_castps_si256(res)));
    __m128i res_1 = _mm_add_epi32(ii1, _mm256_extractf128_si256(_mm256_castps_si256(res), 1));
    return _mm256_insertf128_ps(_mm256_castsi256_ps(_mm256_castsi128_si256(res_0)), _mm_castsi128_ps(res_1), 1);
}

【讨论】：

_mm256_insertf128_ps(_mm256_setzero_ps(), _mm_castsi128_ps(res_0), 0) 是多余的；当您已经要插入一个新的高半部分时，您不需要通过插入一个零向量来对其进行零扩展。刚投。顺便说一句，大多数现代 CPU 也有 AVX2，所以你根本不需要解压到 128 位，除非你需要在 Bulldozer 系列和 SnB/IvB 上运行。 Haswell 及更高版本，以及 Ryzen（甚至是挖掘机 APU）都有 AVX2。或者低功耗的英特尔（Silvermont 系列）甚至没有 AVX，现代奔腾/赛扬芯片也没有。所以是的，周围有一些仅支持 AVX1 的 CPU，但越来越少了。
你是对的，但是这段代码是用于 AVX 指令的。我在代码 cmets 中编写了可以使用 AVX2 和 FMA 改进的代码
我评论的第一部分仍然适用于仅 AVX1。你只需要一个_mm256_insertf128_ps（直接1）。您的最后一行过于复杂。

【解决方案4】：

你可以approximate the exponent yourself with Taylor series:

exp(z) = 1 + z + pow(z,2)/2 + pow(z,3)/6 + pow(z,4)/24 + ...

为此，您只需要 AVX 中的加法和乘法运算。 1/2、1/6、1/24 等系数如果硬编码然后乘以而不是除以更快。

根据您的精度要求，选取尽可能多的序列成员。请注意，您会得到相对错误：对于小的z，绝对值可能是1e-6，但对于大的z，绝对值会大于1e-6，仍然abs(E-E1)/abs(E) - 1 小于@987654329 @（其中E 是精确指数，E1 是近似值）。

更新：正如@Peter Cordes 在评论中提到的那样，可以通过分离整数和小数部分的幂运算来提高精度，通过操纵二进制 float 表示的指数字段来处理整数部分（基于 2 ^x，而不是 e^x)。那么你的泰勒级数只需要在一个小范围内最小化误差。

【讨论】：

在实践中这不会很有效，除非 z 非常接近于零（！）此外，对于 z 的负值，可能会出现严重的准确性问题。高效且准确的 exp 近似值应至少使用某种形式的范围缩减。对于函数逼近，通常使用（有理）切比雪夫逼近而不是泰勒级数，因为它们具有更好的数值特性。
@wim，关于实施有什么想法吗？
@Royi 试试avx_mathfun 看看它是否适用于您的应用程序。令人惊讶的是，它使用了范围缩减和简单的泰勒近似！
不用宏来设置常量，只需使用标准的_mm256_set1_ps()和_mm256_set1_epi32()等。
为了不让泰勒展开式变得糟糕，您应该只将它用于输入的小数部分。取整数部分并将其填充到float 的指数字段中以获得 2^x。（我认为，在某处的额外乘法可以解释 2^x 与 e^x）。使用仅小数部分的多项式，您只需在非常更小的范围内最小化误差。这与您用于 log(x) 的反向技巧相同：提取输入的指数以获得 log2(integer_part(x))。