推导平方根近似的误差范围答案

【问题标题】：Deriving error bounds for approximation of square root推导平方根近似的误差范围
【发布时间】：2016-06-19 18:23:16
【问题描述】：

在this answer 的一个关于四元数归一化的问题中，作者提供了一些计算平方根的代码，使用2.0 / (1.0 + qmagsq) 作为1.0 / std::sqrt(qmagsq) 的近似值，对于非常接近1 的值：

double qmagsq = quat.square_magnitude();
if (std::abs(1.0 - qmagsq) < 2.107342e-08) {
    quat.scale (2.0 / (1.0 + qmagsq));
} else {
    quat.scale (1.0 / std::sqrt(qmagsq));
}

然后作者给出如下解释：

对于介于 0 和 2 之间的 qmagsq 值，此近似值的误差小于 (1-qmagsq)^2 / 8。幻数 2.107342e-08 表示此错误在 IEEE 双倍 ULP 的一半以上。

大概是因为sqrt(8 * 2^-(1+52) / 2) 大约是2.10734243e-8，其中2^-(1+52) / 2 是double 精度的一半。

对于介于 0 和 2 之间的 qmagsq 的值，如何得出 (1-qmagsq)^2 / 8 作为该近似值的误差上限？

编辑：

已经指出作者does not actually hold为qmagsq的值在0和1之间提供的错误界限。结果问题变得更加开放了：

如何推导出此近似值的误差界限，该界限可用于确定该近似值的误差小于 IEEE 双精度 ULP 一半的范围？

【问题讨论】：

我不这么认为。该范围必须包含小于 1 的值，因为近似值用于qmagsq 小于 1 的值。
我的意思是：对于qmagsq = 0.1，我们有2.0 / (1.0 + qmagsq) ~= 1.81818，1.0 / sqrt (qmagsq) ~= 3.16228，但((1-qmagsq)**2)/8 = 0.10125。公式不成立。
我明白你的意思——对于从 0 到 1 的所有 qmagsq 值，错误界限似乎都无效——我已经相应地更新了问题。
@njuffa 但是该近似值并未用于整个区间 (0,2]，它仅用于一个分支，即 (1-e,1+e)，其中 e 是魔法number 2.107342e-08 所以必须在每个分支上针对区间 (0,1-e] (1-e,1+e) 和 [1+e,2] 进行错误分析
@aka.nice 正在一个分支上针对另一个分支执行错误分析。错误分析的目标是确定 e.

标签： floating-point floating-accuracy

【解决方案1】：

计算机的速度已经足够快，可以在整个输入域中详尽地测试单参数函数的特定断言以实现单精度，并以相当小的间隔测试双精度。实际界限通常约为 2⁴⁸ 个测试向量左右。我假设使用 IEEE-754 兼容平台，默认舍入模式为舍入到最近或偶数，并且所有代码都是在编译器可以召集的最严格遵守 IEEE-754 的情况下构建的（对于我的英特尔编译器，例如这是/fp:strict）。

问题中的主张是，快速替换在统一附近实现了 0.5 ulp 或更小的误差。换句话说，使用 IEEE-754 舍入模式将结果正确舍入到最近或偶数。有两种测试该断言的方法：使用正确舍入的 rsqrt() 实现作为参考，并以 1 ulp 步骤迭代参数直到发现不匹配，或者使用多精度库作为参考，然后停止当快速替代的 ulp 误差超过 0.5 ulps 时。在后一种情况下，我们需要比双精度精度高一倍多一点的参考结果，以避免双舍入效应。对于倒数平方根，一个 2n+3 位的引用就足够了：

Cristina Iordache 和 David W. Matula：“论除法、平方根、倒数和平方根倒数的无限精确舍入”。在第 14 届 IEEE 计算机算术研讨会论文集上，澳大利亚阿德莱德，1999 年 4 月 14-16 日，第 233-240 页

下面的 ISO-C99 代码使用第一种方法。它从单位开始搜索，然后在零方向或二方向搜索，在第一个不匹配处停止。输出如下：

arg = (1.0 + 2.2204460492503131e-016)  quick_rsqrt =  0x1.0000000000000p+0 (1.0000000000000000e+000)  rsqrt_rn =  0x1.fffffffffffffp-1 (9.9999999999999989e-001)   
arg = (1.0 - 1.2166747276332046e-008)  quick_rsqrt =  0x1.0000001a20bd7p+0 (1.0000000060833736e+000)  rsqrt_rn =  0x1.0000001a20bd8p+0 (1.0000000060833738e+000)

我也尝试了第二种方法，并得到了匹配的结果。快速替换在（1.0 + 2.2204460492503131e-16）处的误差为0.9995 ulps，在（1.0 - 1.2166747276332046e-8）处的误差为0.5002 ulps。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

/* function under test */
double quick_rsqrt (double a)
{
    return 2.0 / (1.0 + a);
}

/* starting approximation for reciprocal square root */
double simple_rsqrt (double a)
{
    return 1.0 / sqrt (a);
}

/* most significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t hi_uint32_of_double (double a)
{
    uint64_t t;
    memcpy (&t, &a, sizeof t);
    return (uint32_t)(t >> 32);
}

/* least significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t lo_uint32_of_double (double a)
{
    uint64_t t;
    memcpy (&t, &a, sizeof t);
    return (uint32_t)t;
}

/* construct IEEE-754 binary64 from upper and lower half of its bit representation */
double mk_double_from_hilo_uint32 (uint32_t hi, uint32_t lo)
{
    double r;
    uint64_t t = ((uint64_t)hi << 32)  + ((uint64_t)lo);
    memcpy (&r, &t, sizeof r);
    return r;
}

/* reciprocal square root, rounded to-nearest-or-even */
double rsqrt_rn (double a)
{
    double y, h, l, e;
    uint32_t alo, ahi, temp;
    int32_t diff;
    
    ahi = hi_uint32_of_double (a);
    alo = lo_uint32_of_double (a);
    if ((ahi - 0x00100000u) < 0x7fe00000u) { // positive normals
        /* scale argument towards unity */
        temp = (ahi & 0x3fffffff) | 0x3fe00000;
        diff = temp - ahi; // exponent difference
        a = mk_double_from_hilo_uint32 (temp, alo); 
        /* initial rsqrt approximation */
        y = simple_rsqrt (a);
        /* refine reciprocal square root approximation */
        h = y * y;
        l = fma (y, y, -h);
        e = fma (l, -a, fma (h, -a, 1.0));
        /* round according to Peter Markstein, "IA-64 and Elementary Functions" */
        y = fma (fma (0.375, e, 0.5), e * y, y);
        /* scale result near unity to correct range */
        diff = diff >> 1; // adjust exponent; ensure arithmetic right shift which is not guaranteed by ISO-C99
        a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + diff, lo_uint32_of_double (y));
    } else if (a == 0.0) { // zeros
        a = mk_double_from_hilo_uint32 ((ahi & 0x80000000) | 0x7ff00000, 0x00000000);
    } else if (a < 0.0) { // negatives
        a = mk_double_from_hilo_uint32 (0xfff80000, 0x00000000);
    } else if (isinf (a)) { // infinities
        a = mk_double_from_hilo_uint32 (ahi & 0x80000000, 0x00000000);
    } else if (isnan (a)) { // NaNs
        a = a + a;
    } else { // positive subnormals
        /* scale argument towards unity */
        a = a * mk_double_from_hilo_uint32 (0x7fd00000, 0);
        /* initial rsqrt approximation */
        y = simple_rsqrt (a);
        /* refine reciprocal square root approximation */
        h = y * y;
        l = fma (y, y, -h);
        e = fma (l, -a, fma (h, -a, 1.0));
        /* round according to Peter Markstein, "IA-64 and Elementary Functions" */
        y = fma (fma (0.375, e, 0.5), e * y, y);
        /* scale result near unity to correct range */
        a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + 0x1ff00000, lo_uint32_of_double (y));
    }
    return a;
}

int main (void)
{
    double x, ref, res;

    /* Try arguments greater than unity */
    x = 1.0;
    do {
        res = quick_rsqrt (x);
        ref = rsqrt_rn (x);
        if (res != ref) {
            printf ("arg = (1.0 + %23.16e)  quick_rsqrt = %21.13a (%23.16e)  rsqrt_rn = %21.13a (%23.16e)\n", 
                    x - 1.0, res, res, ref, ref);
            break;
        }
        x = nextafter (x, 2.0);
    } while (x < 2.0);

    /* Try arguments less than unity */
    x = 1.0;
    do {
        res = quick_rsqrt (x);
        ref = rsqrt_rn (x);
        if (res != ref) {
            printf ("arg = (1.0 - %23.16e)  quick_rsqrt = %21.13a (%23.16e)  rsqrt_rn = %21.13a (%23.16e)\n", 
                    1.0 - x, res, res, ref, ref);
            break;
        }
        x = nextafter (x, 0.0);
    } while (x > 0.0);

    return EXIT_SUCCESS;
}

【讨论】：