获得接近 2 次幂数的快速方法（浮点数）答案

【问题标题】：Fast way to get a close power-of-2 number (floating-point)获得接近 2 次幂数的快速方法（浮点数）
【发布时间】：2019-01-21 20:45:27
【问题描述】：

在数值计算中，通常需要将数字缩放到安全范围内。

例如计算欧式距离：sqrt(a^2+b^2)。在这里，如果a 或b 的大小太小/太大，则可能发生下溢/上溢。

解决此问题的常用方法是将数字除以最大幅度数。但是，这个解决方案是：

慢（除法很慢）
导致一些额外的不准确

所以我认为与其除以最大幅度数，不如将其乘以一个接近的 2 次幂倒数。这似乎是一个更好的解决方案，因为：

乘法比除法快得多
精度更高，因为乘以 2 的幂数是精确的

所以，我想创建一个小型实用函数，它的逻辑如下（^，我的意思是求幂）：

void getScaler(double value, double &scaler, double &scalerReciprocal) {
    int e = <exponent of value>;
    if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }
    } else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }
    } else { scaler=2^e; scalerReciprocal = 2^(2046-e); }
}

这个函数应该返回一个标准化的scaler & scalerReciprocal，都是2的幂，其中scaler接近value，scalerReciprocal是scaler的倒数。

scaler/scaleReciprocal 的最大允许指数是 -1022..1022（我不想使用低于标准的 scaler，因为低于标准的数字可能很慢）。

什么是快速的方法来做到这一点？这可以通过纯浮点运算来完成吗？或者我应该从value 中提取指数，并使用简单的ifs 来执行逻辑？是否有某种技巧可以快速与 (-)1022 进行比较（因为范围是对称的）？

注意：scaler 不需要是最接近的 2 次方。如果某些逻辑需要它，scaler 可以与最接近的值相差一些小的 2 次方。

【问题讨论】：

我不认为它会回答你的问题，但如果你正在寻找提高性能的最快的乘法和除以 2 的方法是位移。在信号处理中，您减去信号的偏移量，然后按您所说的将其缩放以使范围为 [0,1] 并且您将滤波器设计为最大幅度为 1。另外我不明白：您仅在指数的情况下进行缩放小于 -1022 或大于 1022：这是为什么呢？应该是什么范围？
@FrancescoBoi：不，缩放总是会发生。但是scaler 应该只有-1022..1022 之间的指数（这几乎是整个范围。只有有问题的边框值被消除）。
您是否对能够为 x86 高效编译的可移植纯 C 感兴趣，或者您是否也对带有 SIMD 指令内在函数的 C 感兴趣，例如 AVX512 _mm512_getexp_pd（将指数提取为 double）和 @ 987654321@ 执行 dst[63:0] := tmp_src1[63:0] * POW(2, FLOOR(tmp_src2[63:0])) （即，将双精度的整数部分添加到另一个的指数字段。）
@PeterCordes：我最感兴趣的是纯 C，或者可能是广泛可用的扩展。 AVX 仍然没有那么普遍。但是谢谢你提到它，我不知道_mm512_getexp_pd/_mm512_scalef_pd。它们看起来类似于旧的FXTRACT/FSCALE。
@PeterCordes 关于三元运算符和minsd、minpd、maxsd、maxpd，不知何故，clang 比 gcc 做得更好。 Godbolt link

标签： c++ x86 floating-point ieee-754

【解决方案1】：

函数s = get_scale(z) 计算“2 的关闭幂”。由于s 的小数位为零，s 的倒数只是一个（便宜的）整数减法：参见函数inv_of_scale。

在 x86 上，get_scale 和 inv_of_scale 使用 clang 编译为非常高效的汇编。编译器 clang 将三元运算符转换为 minsd 和 maxsd，另请参阅 Peter Cordes 的comment。使用 gcc，效率更高一些将这些函数转换为 x86 内部函数代码（get_scale_x86 和 inv_of_scale_x86），see Godbolt。

注意C explicitly permits type-punning through a union, whereas C++ (c++11) has no such permission 虽然 gcc 8.2 和 clang 7.0 不抱怨 union，但可以改进 using the memcpy trick 的 C++ 可移植性，而不是工会把戏。对代码的这种修改应该是微不足道的。代码应正确处理次规范。

#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */

union dbl_int64{
    double d;
    uint64_t i;
};

double get_scale(double t){
    union dbl_int64 x;
    union dbl_int64 x_min;
    union dbl_int64 x_max;
    uint64_t mask_i;
           /* 0xFEDCBA9876543210 */
    x_min.i = 0x0010000000000000ull;
    x_max.i = 0x7FD0000000000000ull;
    mask_i =  0x7FF0000000000000ull;
    x.d = t;
    x.i = x.i & mask_i;                    /* Set fraction bits to zero, take absolute value */
    x.d = (x.d < x_min.d) ? x_min.d : x.d; /* If subnormal: set exponent to 1                */
    x.d = (x.d > x_max.d) ? x_max.d : x.d; /* If exponent is very large: set exponent to 7FD, otherwise the inverse is a subnormal */
    return x.d;
}

double get_scale_x86(double t){
    __m128d x = _mm_set_sd(t);
    __m128d x_min = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
    __m128d x_max = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
    __m128d mask  = _mm_castsi128_pd(_mm_set1_epi64x(0x7FF0000000000000ull));
            x     = _mm_and_pd(x, mask);
            x     = _mm_max_sd(x, x_min);
            x     = _mm_min_sd(x, x_max);
    return _mm_cvtsd_f64(x);
}

/* Compute the inverse 1/t of a double t with all zero fraction bits     */
/* and exponent between the limits of function get_scale                 */
/* A single integer subtraction is much less expensive than a            */
/* floating point division.                                               */
double inv_of_scale(double t){
    union dbl_int64 x;
                     /* 0xFEDCBA9876543210 */
    uint64_t inv_mask = 0x7FE0000000000000ull;
    x.d = t;
    x.i = inv_mask - x.i;
    return x.d;
}

double inv_of_scale_x86(double t){
    __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
    __m128d x        = _mm_set_sd(t);
    __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
    return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}


int main(){
    int n = 14;
    int i;
    /* Several example values, 4.94e-324 is the smallest subnormal */
    double y[14] = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,  
                     1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};
    double z, s, u;

    printf("Portable code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale(z);
        u = inv_of_scale(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    printf("\nx86 specific SSE code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale_x86(z);
        u = inv_of_scale_x86(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    return 0;
}

输出看起来不错：

Portable code:
             x       pow_of_2        inverse       pow2*inv      x*inverse 
 4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16
 1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13
 1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00
  1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00
  7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00
  1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00
  1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00
 1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00
 1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00
-1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13
 -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00
 -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00
 -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00
-1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00

x86 specific SSE code:
             x       pow_of_2        inverse       pow2*inv      x*inverse 
 4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16
 1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13
 1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00
  1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00
  7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00
  1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00
  1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00
 1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00
 1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00
-1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13
 -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00
 -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00
 -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00
-1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00

矢量化

函数get_scale 应该使用支持自动向量化的编译器进行向量化。下面一段代码vectorizes very well with clang（无需编写 SSE/AVX 内部代码）。

/* Test how well get_scale vectorizes: */
void get_scale_vec(double * __restrict__ t, double * __restrict__ x){
    int n = 1024;
    int i;
    for (i = 0; i < n; i++){
        x[i] = get_scale(t[i]);
    }
}

很遗憾 gcc 没有找到 vmaxpd 和 vminpd 指令。

【讨论】：

感谢您的回答！根据您的解决方案，我找到了一个（可能）更快的解决方案。
Re：union type-punning：GNU C++ 明确支持它作为 ISO C++ 的扩展。请参阅gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type-punning 和gcc.gnu.org/onlinedocs/gcc/…。我认为 MSVC 也支持它，但 IDK 如果有文档。不过，对于好的编译器使用 memcpy 并没有任何缺点，只要它是类型的完整宽度。
@PeterCordes：据我所知，MSVC 没有利用基于严格别名规则的优化。例如，this code 会不必要地重新加载 *a 两次。
@geza：严格别名是相关的，但与联合类型双关语分开。 *(int*)&my_float 仍然是 GNU C 中的 UB，尽管它在实践中通常适用于像这样没有混合 FP 操作的简单情况。
@PeterCordes：当然 :) 我的意思是，对于 MSVC，所有类型的双关语都有效，而不仅仅是基于联合，比如 gcc/clang。不过，我从未在任何地方看到过这种记录，这只是基于我的经验。

【解决方案2】：

根据 wim 的回答，这是另一种解决方案，它可以更快，因为它的指令更少。输出有点不同，但仍然满足要求。

这个想法是使用位操作来修复边界情况：将01 放在指数的 lsb 上，无论其值如何。所以，指数：

0 变为 1（-1023 变为 -1022）
2046 变成 2045（1023 变成 1022）
其他指数也进行了修改，但只是略微修改：与 wim 的解决方案相比，该数字可以变大两倍（当指数 lsb 从 00 变为 01 时），或减半（当 10->01 时）或 1 /4（11->01时）

所以，这个修改后的例程有效（而且我认为这个问题可以只用2 fast asm instructions 解决，这很酷）：

#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */

union dbl_int64{
    double d;
    uint64_t i;
};

double get_scale(double t){
    union dbl_int64 x;
    uint64_t and_i;
    uint64_t or_i;
         /* 0xFEDCBA9876543210 */
    and_i = 0x7FD0000000000000ull;
    or_i =  0x0010000000000000ull;
    x.d = t;
    x.i = (x.i & and_i)|or_i;                     /* Set fraction bits to zero, take absolute value */
    return x.d;
}

double get_scale_x86(double t){
    __m128d x = _mm_set_sd(t);
    __m128d x_and = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
    __m128d x_or  = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
            x     = _mm_and_pd(x, x_and);
            x     = _mm_or_pd(x, x_or);
    return _mm_cvtsd_f64(x);
}

/* Compute the inverse 1/t of a double t with all zero fraction bits     */
/* and exponent between the limits of function get_scale                 */
/* A single integer subtraction is much less expensive than a            */
/* floating point division.                                               */
double inv_of_scale(double t){
    union dbl_int64 x;
                     /* 0xFEDCBA9876543210 */
    uint64_t inv_mask = 0x7FE0000000000000ull;
    x.d = t;
    x.i = inv_mask - x.i;
    return x.d;
}

double inv_of_scale_x86(double t){
    __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
    __m128d x        = _mm_set_sd(t);
    __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
    return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}


int main(){
    int n = 14;
    int i;
    /* Several example values, 4.94e-324 is the smallest subnormal */
    double y[14] = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,  
                     1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};
    double z, s, u;

    printf("Portable code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale(z);
        u = inv_of_scale(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    printf("\nx86 specific SSE code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale_x86(z);
        u = inv_of_scale_x86(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    return 0;
}

【讨论】：

你数的是哪 3 条指令？你的将它减少到 2，只是 ANDPS / ORPS 而不是 ANDPS/MINPD/MAXPD。或者，如果您根据最大幅度计算实际缩放 2 个值，那么您需要 AND + AND +（使用 MAXPD 或 AVX512 VPMAXUQ 选择最高指数）+ OR + PSUBQ，然后使用 2x 将其应用于两个输入MULPD 或 VFMADD...如果您或编译器可以将标准化后的第一步收缩为 FMA。
顺便说一句，避免标量的内在函数；它们很糟糕，因为没有办法告诉编译器你想要一个带有未定义上元素的向量，即没有标量-> 128 等效于__m256 _mm256_castps128_ps256 (__m128 a)。不幸的是，除了 clang 之外，大多数编译器实际上确实浪费了一条针对 _mm_set_sd(t) 的零扩展指令。 How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?。只需使用 union type-pun 版本，我认为所有主要的 x86 编译器 su
哦，我刚刚看了你的 Godbolt 链接。这很奇怪，标量编译器即使在循环中也无法使用ANDPS / ORPS，并且实际上使用movq 提取到GP regs。 x86-64 System V 中没有保留调用的 XMM 寄存器，因此它们无法提升常量，但从内存中使用它们仍然是一个胜利。不过，希望编译器能够自动矢量化纯 C 版本。
您可能希望添加一个函数版本，该版本接受 2 个输入并返回一个用于两者的比例因子，因为就像我说的那样，您可以将找到的最大量值与归零有效数字结合起来。
@PeterCordes：使用 AND/OR 的想法，正常值被缩放到 [0.5...8.0) 范围，这对于可靠地计算斜边来说确实是完美的。

【解决方案3】：

你可以使用

double frexp (double x, int* exp);

返回值是 x 的小数部分，exp 是指数（减去偏移量）。

或者，以下代码获取双精度数的指数部分。

int get_exp(double *d) {
  long long *l = (long long *) d;
  return ((*l & (0x7ffLL << 52) )>> 52)-1023 ;
}

【讨论】：

frexp 在这里做的比需要的多。并且做得更少，因为我需要钳制 exp，然后我需要转换回来以获得double。我不认为frexp 在我的情况下真的有用，因为我需要速度。我宁愿手动提取指数，如果这是要走的路（它只是一个memcpy，移位和一个掩码）。
@geza: frexp 只是一个转变和一个面具，标准化和记录以使其可移植。如果要调整指数，请将其与ldexp 配对（注意ldexp 添加到指数而不是替换它）
@BenVoigt：不幸的是，没有。查看源代码。它处理 nan/inf 和次正规数。它做了一些逻辑，这可能是我的if (exp<-1022) .. 逻辑的一部分。所以对于frexp，我会有多余的代码。我并不是说它很慢。但是，如果那样的话，最好（对我来说）手动提取指数。