融合乘加 (FMA) 操作在提供减法消除保护方面的实用性源于完整的双倍宽度乘积参与最终加法。据我所知,它在准确、稳健地计算二次方程解方面的实用性的第一个公开记录是著名浮点专家威廉·卡汉 (William Kahan) 的两组非正式笔记:
William Kahan,“Matlab 的损失是没有人的收获”。 1998 年 8 月,2004 年 7 月修订 (online)
威廉·卡汉(William Kahan),“关于没有超精确算术的浮点计算的成本”。 2004 年 11 月 (online)
Higham 的数值计算标准工作是我第一次遇到 Kahan 的算法,该算法应用于计算 2x2 矩阵的行列式(第 65 页):
Nicholas J. Higham,“数值算法的准确性和稳定性”,SIAM 1996
三位英特尔研究人员在英特尔首款支持 FMA 的 CPU 安腾处理器 (p. 273) 的背景下发布了另一种计算 ab+cd 的算法,同样基于 FMA:
Marius Cornea、John Harrison 和 Ping Tak Peter Tang:“基于安腾系统的科学计算”。英特尔出版社 2002
近年来,法国研究人员的四篇论文详细研究了这两种算法,并提供了带有数学证明的误差范围。对于二进制浮点运算,如果中间计算中没有上溢或下溢,Kahan 算法和 Cornea-Harrison-Tang (CHT) 算法的最大相对误差均显示为两倍单位渐近四舍五入,即 2u。对于 IEEE-754 binary32 或单精度,此误差范围为 2-23,对于 IEEE-754 binary64 或双精度,此误差范围为 2-52。
此外,对于二进制浮点算术,Kahan 算法的误差最多为 1.5 ulps。从文献中我不知道 CHT 算法的等效结果,即已证明的 ulp 误差界。我自己使用下面的代码进行的实验建议错误界限为 1.25 ulp。
Sylvie Boldo,“Kahan 的正确判别计算算法终于被正式证明”,
IEEE 计算机交易,卷。 58,第 2 期,2009 年 2 月,第 220-225 页 (online)
Claude-Pierre Jeannerod、Nicolas Louvet 和 Jean-Michel Muller,“进一步分析 Kahan 的 2x2 行列式精确计算算法”,计算数学,卷。 82,284,2013年10月,第2245-2264页(online)
Jean-Michel Muller,“关于使用 Cornea、Harrison 和 Tang 方法计算 ab+cd 的错误”,ACM Transactions on Mathematical Software,卷。 41号2015年1月第7条(online)
Claude-Pierre Jeannerod,“Cornea-Harrison-Tang 方法的基数独立误差分析”,ACM Transactions on Mathematical Software Vol。 2016年5月42号第3期第19条(online)
Kahan 的算法需要四个浮点运算,其中两个是 FMA,而 CHT 算法需要七个浮点运算,其中两个是 FMA。我构建了下面的测试框架来探索可能存在的其他权衡。我通过实验证实了文献中关于两种算法的相对误差和 Kahan 算法的 ulp 误差的界限。我的实验表明,CHT 算法提供的 ulp 误差范围更小,为 1.25 ulp,但它也以大约两倍于 Kahan 算法的速率产生不正确舍入的结果。
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>
#define TEST_SUM (0) // function under test. 0: a*b-c*d; 1: a*b+c*d
#define USE_CHT (0) // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang
/*
Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
"Further Analysis of Kahan's Algorithm for the Accurate Computation
of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284,
Oct. 2013, pp. 2245-2264
*/
float diff_of_products_kahan (float a, float b, float c, float d)
{
float w = d * c;
float e = fmaf (c, -d, w);
float f = fmaf (a, b, -w);
return f + e;
}
/*
Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the
Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
Vol. 42, No. 3, Article 19 (May 2016).
*/
float diff_of_products_cht (float a, float b, float c, float d)
{
float p1 = a * b;
float p2 = c * d;
float e1 = fmaf (a, b, -p1);
float e2 = fmaf (c, -d, p2);
float r = p1 - p2;
float e = e1 + e2;
return r + e;
}
/*
Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23
Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,
Harrison and Tang's Method", ACM Transactions on Mathematical Software,
Vol. 41, No.2, Article 7, (January 2015)
*/
float sum_of_products_kahan (float a, float b, float c, float d)
{
float w = c * d;
float e = fmaf (c, -d, w);
float f = fmaf (a, b, w);
return f - e;
}
/*
Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the
Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
Vol. 42, No. 3, Article 19 (May 2016).
*/
float sum_of_products_cht (float a, float b, float c, float d)
{
float p1 = a * b;
float p2 = c * d;
float e1 = fmaf (a, b, -p1);
float e2 = fmaf (c, d, -p2);
float r = p1 + p2;
float e = e1 + e2;
return r + e;
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
typedef struct {
double y;
double x;
} dbldbl;
dbldbl make_dbldbl (double head, double tail)
{
dbldbl z;
z.x = tail;
z.y = head;
return z;
}
dbldbl add_dbldbl (dbldbl a, dbldbl b) {
dbldbl z;
double t1, t2, t3, t4, t5;
t1 = a.y + b.y;
t2 = t1 - a.y;
t3 = (a.y + (t2 - t1)) + (b.y - t2);
t4 = a.x + b.x;
t2 = t4 - a.x;
t5 = (a.x + (t2 - t4)) + (b.x - t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = (t1 - t4) + t3;
t3 = t3 + t5;
z.y = t4 + t3;
z.x = (t4 - z.y) + t3;
return z;
}
dbldbl sub_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double t1, t2, t3, t4, t5;
t1 = a.y - b.y;
t2 = t1 - a.y;
t3 = (a.y + (t2 - t1)) - (b.y + t2);
t4 = a.x - b.x;
t2 = t4 - a.x;
t5 = (a.x + (t2 - t4)) - (b.x + t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = (t1 - t4) + t3;
t3 = t3 + t5;
z.y = t4 + t3;
z.x = (t4 - z.y) + t3;
return z;
}
dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
dbldbl t, z;
t.y = a.y * b.y;
t.x = fma (a.y, b.y, -t.y);
t.x = fma (a.x, b.x, t.x);
t.x = fma (a.y, b.x, t.x);
t.x = fma (a.x, b.y, t.x);
z.y = t.y + t.x;
z.x = (t.y - z.y) + t.x;
return z;
}
double prod_diff_ref (float a, float b, float c, float d)
{
dbldbl t = sub_dbldbl (
mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
);
return t.x + t.y;
}
double prod_sum_ref (float a, float b, float c, float d)
{
dbldbl t = add_dbldbl (
mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
);
return t.x + t.y;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
static double floatUlpErr (float res, double ref)
{
uint64_t i, j, err;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of float's limited exponent
range.
*/
expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
} else {
j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
}
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
const uint64_t N = 1ULL << 38;
double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;
uint64_t count = 0LL, incorrectly_rounded = 0LL;
uint32_t ai, bi, ci, di;
float af, bf, cf, df, resf;
#if TEST_SUM
printf ("testing a*b+c*d ");
#else
printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
printf ("using Cornea-Harrison-Tang algorithm\n");
#else
printf ("using Kahan algorithm\n");
#endif
do {
do {
ai = KISS;
af = __uint32_as_float (ai);
} while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
do {
bi = KISS;
bf = __uint32_as_float (bi);
} while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
do {
ci = KISS;
cf = __uint32_as_float (ci);
} while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
do {
di = KISS;
df = __uint32_as_float (di);
} while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
count++;
#if TEST_SUM
#if USE_CHT
resf = sum_of_products_cht (af, bf, cf, df);
#else // USE_CHT
resf = sum_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
ref = prod_sum_ref (af, bf, cf, df);
#else // TEST_SUM
#if USE_CHT
resf = diff_of_products_cht (af, bf, cf, df);
#else // USE_CHT
resf = diff_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
ref = prod_diff_ref (af, bf, cf, df);
#endif // TEST_SUM
ulp = floatUlpErr (resf, ref);
incorrectly_rounded += ulp > 0.5;
relerr = fabs ((resf - ref) / ref);
if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
maxulp = ulp;
maxrelerr = relerr;
printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",
count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);
}
} while (count <= N);
return EXIT_SUCCESS;
}