两个乘积的和和差的精确浮点计算答案

【问题标题】：Accurate floating-point computation of the sum and difference of two products两个乘积的和和差的精确浮点计算
【发布时间】：2020-12-19 05:32:39
【问题描述】：

两个乘积的差和两个乘积的和是在各种常见计算中发现的两个原语。 diff_of_products (a,b,c,d) := ab - cd 和 sum_of_products(a,b,c,d) := ab + cd 是密切相关的伴随函数，它们的区别仅在于它们的一些操作数的符号。这些原语的使用示例如下：

用 x = (a + i b) 和 y = (c + i d) 计算复数乘法：

x*y = diff_of_products (a, c, b, d) + i sum_of_products (a, d, b, c)

2x2矩阵行列式的计算：diff_of_products (a, d, b, c)：

| a  b |
| c  d |

在直角三角形computation of the length of the opposite cathesus 中，从斜边h 和相邻的导管a：diff_of_products (h, h, a, a)

用正判别式计算两个实数solutions of a quadratic equation：

q = -(b + copysign (sqrt (diff_of_products (b, b, 4a, c)), b)) / 2
x₀ = q / a
x₁ = c / q

Computation of a 3D cross producta = b ⨯ c:

a_x = diff_of_products (b_y, c_z, b_z, c_y)
a_y = diff_of_products (b_z, c_x, b_x, c_z)
a_z = diff_of_products (b_x, c_y, b_y, c_x)

当使用 IEEE-754 二进制浮点格式进行计算时，除了明显的潜在上溢和下溢问题外，当两个乘积的幅度相似但 sum_of_products() 的符号相反时，任一函数的幼稚实现都可能遭受灾难性的抵消或 diff_of_products() 的符号相同。

只关注精度方面，如何在 IEEE-754 二进制算术的上下文中稳健地实现这些功能？可以假设融合乘加操作的可用性，因为大多数现代处理器架构都支持此操作，并且通过标准函数在许多编程语言中公开。在不失一般性的情况下，可以将讨论限制为单精度 (IEEE-754 binary32) 格式，以便于阐述和测试。

【问题讨论】：

标签： algorithm floating-point floating-accuracy

【解决方案1】：

融合乘加 (FMA) 操作在提供减法消除保护方面的实用性源于完整的双倍宽度乘积参与最终加法。据我所知，它在准确、稳健地计算二次方程解方面的实用性的第一个公开记录是著名浮点专家威廉·卡汉 (William Kahan) 的两组非正式笔记：

William Kahan，“Matlab 的损失是没有人的收获”。 1998 年 8 月，2004 年 7 月修订 (online)
威廉·卡汉（William Kahan），“关于没有超精确算术的浮点计算的成本”。 2004 年 11 月 (online)

Higham 的数值计算标准工作是我第一次遇到 Kahan 的算法，该算法应用于计算 2x2 矩阵的行列式（第 65 页）：

Nicholas J. Higham，“数值算法的准确性和稳定性”，SIAM 1996

三位英特尔研究人员在英特尔首款支持 FMA 的 CPU 安腾处理器 (p. 273) 的背景下发布了另一种计算 ab+cd 的算法，同样基于 FMA：

Marius Cornea、John Harrison 和 Ping Tak Peter Tang：“基于安腾系统的科学计算”。英特尔出版社 2002

近年来，法国研究人员的四篇论文详细研究了这两种算法，并提供了带有数学证明的误差范围。对于二进制浮点运算，如果中间计算中没有上溢或下溢，Kahan 算法和 Cornea-Harrison-Tang (CHT) 算法的最大相对误差均显示为两倍单位渐近四舍五入，即 2u。对于 IEEE-754 binary32 或单精度，此误差范围为 2^-23，对于 IEEE-754 binary64 或双精度，此误差范围为 2^-52。

此外，对于二进制浮点算术，Kahan 算法的误差最多为 1.5 ulps。从文献中我不知道 CHT 算法的等效结果，即已证明的 ulp 误差界。我自己使用下面的代码进行的实验建议错误界限为 1.25 ulp。

Sylvie Boldo，“Kahan 的正确判别计算算法终于被正式证明”， IEEE 计算机交易，卷。 58，第 2 期，2009 年 2 月，第 220-225 页 (online)

Claude-Pierre Jeannerod、Nicolas Louvet 和 Jean-Michel Muller，“进一步分析 Kahan 的 2x2 行列式精确计算算法”，计算数学，卷。 82，284，2013年10月，第2245-2264页（online）

Jean-Michel Muller，“关于使用 Cornea、Harrison 和 Tang 方法计算 ab+cd 的错误”，ACM Transactions on Mathematical Software，卷。 41号2015年1月第7条(online)

Claude-Pierre Jeannerod，“Cornea-Harrison-Tang 方法的基数独立误差分析”，ACM Transactions on Mathematical Software Vol。 2016年5月42号第3期第19条(online)

Kahan 的算法需要四个浮点运算，其中两个是 FMA，而 CHT 算法需要七个浮点运算，其中两个是 FMA。我构建了下面的测试框架来探索可能存在的其他权衡。我通过实验证实了文献中关于两种算法的相对误差和 Kahan 算法的 ulp 误差的界限。我的实验表明，CHT 算法提供的 ulp 误差范围更小，为 1.25 ulp，但它也以大约两倍于 Kahan 算法的速率产生不正确舍入的结果。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>

#define TEST_SUM  (0)  // function under test. 0: a*b-c*d; 1: a*b+c*d 
#define USE_CHT   (0)  // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang

/*
  Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
float diff_of_products_kahan (float a, float b, float c, float d)
{
    float w = d * c;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, -w);
    return f + e;
}

/*
  Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
  Vol. 42, No. 3, Article 19 (May 2016).
*/
float diff_of_products_cht (float a, float b, float c, float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a, b, -p1); 
    float e2 = fmaf (c, -d, p2);
    float r = p1 - p2; 
    float e = e1 + e2;
    return r + e;
}

/*
  Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,
  Harrison and Tang's Method", ACM Transactions on Mathematical Software, 
  Vol. 41, No.2, Article 7, (January 2015)
*/
float sum_of_products_kahan (float a, float b, float c, float d)
{
    float w = c * d;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, w);
    return f - e;
}

/*
  Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
  Vol. 42, No. 3, Article 19 (May 2016).
*/
float sum_of_products_cht (float a, float b, float c, float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a, b, -p1); 
    float e2 = fmaf (c, d, -p2);
    float r = p1 + p2; 
    float e = e1 + e2;
    return r + e;
}

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

typedef struct {
    double y;
    double x;
} dbldbl;

dbldbl make_dbldbl (double head, double tail)
{
    dbldbl z;
    z.x = tail;
    z.y = head;
    return z;
}

dbldbl add_dbldbl (dbldbl a, dbldbl b) {
    dbldbl z;
    double t1, t2, t3, t4, t5;
    t1 = a.y + b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) + (b.y - t2);
    t4 = a.x + b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) + (b.x - t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl sub_dbldbl (dbldbl a, dbldbl b)
{
    dbldbl z;
    double t1, t2, t3, t4, t5;
    t1 = a.y - b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) - (b.y + t2);
    t4 = a.x - b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) - (b.x + t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
    dbldbl t, z;
    t.y = a.y * b.y;
    t.x = fma (a.y, b.y, -t.y);
    t.x = fma (a.x, b.x, t.x);
    t.x = fma (a.y, b.x, t.x);
    t.x = fma (a.x, b.y, t.x);
    z.y = t.y + t.x;
    z.x = (t.y - z.y) + t.x;
    return z;
}

double prod_diff_ref (float a, float b, float c, float d)
{
    dbldbl t = sub_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
        );
    return t.x + t.y;
}

double prod_sum_ref (float a, float b, float c, float d)
{
    dbldbl t = add_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
        );
    return t.x + t.y;
}

float __uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t __float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint64_t __double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

static double floatUlpErr (float res, double ref)
{
    uint64_t i, j, err;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN, infinity, zero */
    if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0f)) {
        return 0.0;
    }
    /* Convert the float result to an "extended float". This is like a float
       with 56 instead of 24 effective mantissa bits.
    */
    i = ((uint64_t)__float_as_uint32(res)) << 32;
    /* Convert the double reference to an "extended float". If the reference is
       >= 2^129, we need to clamp to the maximum "extended float". If reference
       is < 2^-126, we need to denormalize because of float's limited exponent
       range.
    */
    expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
    if (expoRef >= 129) {
        j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
            0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    } else {
        j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    }
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}

int main (void)
{
    const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
    const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
    const uint64_t N = 1ULL << 38;
    double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;
    uint64_t count = 0LL, incorrectly_rounded = 0LL;
    uint32_t ai, bi, ci, di;
    float af, bf, cf, df, resf;

#if TEST_SUM
    printf ("testing a*b+c*d ");
#else
    printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
    printf ("using Cornea-Harrison-Tang algorithm\n");
#else
    printf ("using Kahan algorithm\n");
#endif

    do {
        do {
            ai = KISS;
            af = __uint32_as_float (ai);
        } while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
        do {
            bi = KISS;
            bf = __uint32_as_float (bi);
        } while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
        do {
            ci = KISS;
            cf = __uint32_as_float (ci);
        } while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
        do {
            di = KISS;
            df = __uint32_as_float (di);
        } while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
        count++;
#if TEST_SUM        
#if USE_CHT
        resf = sum_of_products_cht (af, bf, cf, df);
#else // USE_CHT
        resf = sum_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
        ref = prod_sum_ref (af, bf, cf, df);
#else // TEST_SUM
#if USE_CHT
        resf = diff_of_products_cht (af, bf, cf, df);
#else // USE_CHT
        resf = diff_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
        ref = prod_diff_ref (af, bf, cf, df);
#endif // TEST_SUM
        ulp = floatUlpErr (resf, ref);
        incorrectly_rounded += ulp > 0.5;
        relerr = fabs ((resf - ref) / ref);
        if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
            maxulp = ulp;
            maxrelerr = relerr;
            printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",
                    count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);
        }
    } while (count <= N);

    return EXIT_SUCCESS;
}

【讨论】：