确定 C/C++ 中的浮点运算是否发生舍入答案

【问题标题】：Determine if rounding occurred for a floating-point operation in C/C++确定 C/C++ 中的浮点运算是否发生舍入
【发布时间】：2019-06-07 17:20:34
【问题描述】：

我正在尝试提出一种有效的方法来确定 IEEE-754 操作何时会/确实会发生舍入。不幸的是，我无法简单地检查硬件标志。它必须在几个不同的平台上运行。

我想到的一种方法是在不同的舍入模式下进行运算来比较结果。

加法示例：

    double result = operand1 + operand2;
    // save rounding mode
    int savedMode = fegetround();
    fesetround(FE_UPWARD);
    double upResult = operand1 + operand2;
    fesetround(FE_DOWNWARD);
    double downResult = operand1 + operand2;
    // restore rounding mode
    fesetround(savedMode);
    return (result != upResult) || (result != downResult);

但这显然是低效的，因为它必须执行 3 次操作。

【问题讨论】：

当你说你不能检查硬件标志时，这是否也意味着你不能使用标准的 C fegetexceptflag？如果是这样，如果有一个好的解决方案，我会感到惊讶。
大多数情况下，当标准 C 例程没有用处（由于缺乏编译器支持）时，解决方案是编写内联汇编并针对每个平台进行自定义。
@EricPostpischil 通过检查硬件标志，我的意思是显式执行操作并在内联汇编中读取标志。使用fegetexceptflag 可能没问题，但如果我要使用它，是否保证检索到的标志与刚刚执行的浮点运算相对应？
浮点标志是累积的，这意味着操作设置它们，并且它们保持设置直到重置。因此，如果您调用fegetexceptflag 并发现已引发不精确标志，则它可能来自最近的操作或您之前检查或清除标志后的任何操作。（当然，您的 C 实现必须支持 fegetexceptflag；在 C 中支持访问浮点环境是可选的。）
读取标志可能会有些“昂贵”，可能是因为它需要处理器内的指令同步，所以通常不会在每次操作后都进行。人们通常会执行一系列操作，然后检查标志以查看其期间是否发生异常，如果发生异常则转移到备用处理，并期望这种转移很少见。对于不精确标志，在典型的浮点工作中设置它并不少见；只有在专门设计的代码中才会很少见。

标签： floating-point rounding ieee-754

【解决方案1】：

您的示例不一定会通过优化给出正确的结果级别-O1 或更高。看到这个Godbolt link：编译器只生成一个加法vaddsd。

经过优化级别-O0 程序集看起来不错，但这会导致代码效率低下。此外调用fegetround 和fesetround 相对昂贵，与一些浮点运算的成本相比。

下面的（自我解释的）代码可能是一个有趣的替代方案。它使用众所周知的算法 2Sum 和 2ProdFMA。在没有硬件 fma 或 fma 仿真的系统上，您可以使用 2Prod 算法而不是 2ProdFMA，参见，例如，准确的浮点乘积和求幂，作者：Stef Graillat。

/*
gcc -m64 -Wall -O3 -march=haswell round_ex.c -lm
   or with fma emulation on systems without hardware fma support, for example:
gcc -m64 -Wall -O3  -march=nehalem  round_ex.c -lm
*/

#include<math.h>
#include<float.h>
#include<stdio.h>

int add_is_not_exact(double operand1, double operand2){
    double a = operand1;
    double b = operand2;
    double s, t, a_1, b_1, d_a, d_b;
    /* Algorithm 2Sum computes s and t such that a + b = s + t, exactly.         */
    /* Here t is the error of the floating-point addition s = a + b.             */
    /* See, for example, On the robustness of the 2Sum and Fast2Sum algorithms   */
    /* by Boldo, Graillat, and Muller                                            */
    s = a + b;
    a_1 = s - b;
    b_1 = s - a_1;
    d_a = a - a_1;
    d_b = b - b_1;
    t = d_a + d_b;
    return (t!=0.0);
}


int sub_is_not_exact(double operand1, double operand2){
    return add_is_not_exact(operand1, -operand2);
}


int mul_is_not_exact(double operand1, double operand2){
    double a = operand1;
    double b = operand2;
    double s, t;
    /* Algorithm 2ProdFMA computes s and t such that a * b = s + t, exactly.     */
    /* Here t is the error of the floating-point multiplication s = a * b.       */
    /* See, for example, Accurate Floating Point Product and Exponentiation      */
    /* by Graillat                                                               */
    s = a * b;
    t = fma(a, b, -s);
    if (s!=0) return (t!=0.0);       /* No underflow of a*b                                */
    else return (a!=0.0)&&(b!=0.0);  /* Underflow: inexact if s=0, but (a!=0.0)&&(b!=0.0)  */
}


int div_is_not_exact(double operand1, double operand2){
    double a = operand1;
    double b = operand2;
    double s, t;
    s = a / b;
    t = fma(s, b, -a);  /* fma(x,y,z) computes x*y+z with infinite intermediate precision */
    return (t!=0.0);
}


int main(){

    printf("add_is_not_exact(10.0, 1.0) = %i\n", add_is_not_exact(10.0, 1.0));
    printf("sub_is_not_exact(10.0, 1.0) = %i\n", sub_is_not_exact(10.0, 1.0));
    printf("mul_is_not_exact( 2.5, 2.5) = %i\n", mul_is_not_exact( 2.5, 2.5));
    printf("div_is_not_exact(  10, 2.5) = %i\n", div_is_not_exact(  10, 2.5));
    printf("add_is_not_exact(10.0, 0.1) = %i\n", add_is_not_exact(10.0, 0.1));
    printf("sub_is_not_exact(10.0, 0.1) = %i\n", sub_is_not_exact(10.0, 0.1));
    printf("mul_is_not_exact( 2.6, 2.6) = %i\n", mul_is_not_exact( 2.6, 2.6));
    printf("div_is_not_exact(  10, 2.6) = %i\n", div_is_not_exact(  10, 2.6));

    printf("\n0x1.0p-300 = %20e, 0x1.0p-600 = %20e \n", 0x1.0p-300 , 0x1.0p-600 );
    printf("mul_is_not_exact( 0x1.0p-300, 0x1.0p-300) = %i\n", mul_is_not_exact( 0x1.0p-300, 0x1.0p-300));
    printf("mul_is_not_exact( 0x1.0p-600, 0x1.0p-600) = %i\n", mul_is_not_exact( 0x1.0p-600, 0x1.0p-600));

}

输出是：

$ ./a.out
add_is_not_exact(10.0, 1.0) = 0
sub_is_not_exact(10.0, 1.0) = 0
mul_is_not_exact( 2.5, 2.5) = 0
div_is_not_exact(  10, 2.5) = 0
add_is_not_exact(10.0, 0.1) = 1
sub_is_not_exact(10.0, 0.1) = 1
mul_is_not_exact( 2.6, 2.6) = 1
div_is_not_exact(  10, 2.6) = 1

0x1.0p-300 =         4.909093e-91, 0x1.0p-600 =        2.409920e-181 
mul_is_not_exact( 0x1.0p-300, 0x1.0p-300) = 0
mul_is_not_exact( 0x1.0p-600, 0x1.0p-600) = 1

如 cmets 中所述，也可以直接读取控制和状态寄存器：

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

int add_is_not_exact_v2(double a, double b)
{    
    fexcept_t excepts;
    feclearexcept(FE_ALL_EXCEPT);
    double c = a+b;
    int tst = fetestexcept(FE_INEXACT);
    return (tst!=0);
}

但是请注意，这可能不适用于编译器优化级别 -O1 或更高级别。在这种情况下，addsd 双加指令有时会被完全优化掉，导致错误的结果。例如，使用 gcc 8.2 gcc -m64 -O1 -march=nehalem:

add_is_not_exact_v2:
        sub     rsp, 8
        mov     edi, 61
        call    feclearexcept
        mov     edi, 32
        call    fetestexcept
        test    eax, eax
        setne   al
        movzx   eax, al
        add     rsp, 8
        ret

优化级别-O0，有2个函数调用，相对修改控制和状态寄存器的扩展指令，这不一定是最有效的解决方案。

【讨论】：

如果我需要知道 fma 操作是否不精确，是否会遵循相同的逻辑？ s = fma(a, b, c); t = fma(a, b, (c-s)); 表示中间c-s 结果时会丢失准确性吗？
不，对于fma 的准确性，这是行不通的。例如fma(1.0, 1.0, 1e-20); 给出的结果不准确，但t=0。另一方面，对于a=1.0+2^(-30) 和b=2^(-60)，fma(a,a,-b) 是准确的，但t!=0。可能有可能以某种方式用2prod 和多个2sums 测试（不）精确性：应该可以计算出精确的a*b+c=s+t+u，并测试t=u=0。
对于无错误的[x, y, z] = ThreeFMA(a, b, c) (a*b+c=x+y+z)，请参阅 Graillat 和 Ménissier-Morain 撰写的 实数和复数浮点算术中的无错误转换 ，并参见 Boldo 和 Muller 的 Some Functions Computable with a Fused-mac。在这种情况下，fma 是不精确的，如果 y+z!=0.0，如果没有发生下溢。如果发生下溢（这种情况很少见），我还没有考虑过解决方案。