为什么hypot()函数这么慢？答案

【问题标题】：Why hypot() function is so slow?为什么hypot()函数这么慢？
【发布时间】：2011-04-15 11:09:56
【问题描述】：

我用C++hypot()和JavaMath.hypot做了一些测试。它们似乎都比sqrt(a*a + b*b) 慢得多。是因为精度更高吗？计算斜边hypot 函数使用什么方法？令人惊讶的是，我在文档中找不到任何表现不佳的迹象。

【问题讨论】：

什么是“明显变慢”？你能量化这个值吗？你用过分析器吗？你运行了多少次测试？您能描述一下您的实验 (DOE) 吗？
在 Java 中它慢了约 7 倍，在 C++ 中慢了约 10 倍。在为即将到来的大学编程竞赛测试一个编程问题时，我们与我的同事独立发现了这一点。
@linuxuser27：以及对他的评论点赞的两个人，请查看 Ravi Gummadi +9 点赞的答案以获得启发。
当然。这是一个很好的答案，我为 9 做出了贡献。我的评论并没有粗鲁。我只是厌倦了关于 SO 的问题，这些问题在没有进行调查的情况下对速度和性能做出了极端的陈述。列昂尼德实际上能够给出一些因素，这意味着已经进行了尽职调查。我希望每个人在做出诸如“X 与 Y 相比太慢”之类的陈述之前都先完成这类工作，因为一半时间所做的只是观察花费了多长时间，而不是实际测量某些东西。
hypot 的糟糕表现在这个基准测试中也很明显：blog.juma.me.uk/2011/02/23/…

标签： java c++ hypotenuse

【解决方案1】：

这不是一个简单的 sqrt 函数。您应该检查此链接以了解算法的实现：http://www.koders.com/c/fid7D3C8841ADC384A5F8DE0D081C88331E3909BF3A.aspx

它有while循环来检查收敛

/* Slower but safer algorithm due to Moler and Morrison.  Never
         produces any intermediate result greater than roughly the
         larger of X and Y.  Should converge to machine-precision
         accuracy in 3 iterations.  */

      double r = ratio*ratio, t, s, p = abig, q = asmall;

      do {
        t = 4. + r;
        if (t == 4.)
          break;
        s = r / t;
        p += 2. * s * p;
        q *= s;
        r = (q / p) * (q / p);
      } while (1);

编辑（来自 J.M 的更新）：

Here 是最初的 Moler-Morrison 论文，here 是 Dubrulle 的一个很好的后续。

【讨论】：

如果这个函数比 sqrt 方法更好，我会很高兴举个例子。
@schnaader - 阅读链接页面，原因就在顶部（短版，天真的方法可能会在不应该的时候溢出）
这对于非常大的 x 和 y 值很重要。例如，如果 x 和 y 使得 x^2 + y^2 大于 MAX_DOUBLE，则 sqrt(x^2 + y^2) 版本将失败。但是，由于此方法不会产生比 x 或 y 大得多的值，因此在这些情况下应该是安全的。
遗憾的是，链接站点 koders.com 现在返回错误 403。
@alvherre：我更新了链接以指向 archive.org 上的最新快照，日期为 2010 年 7 月 10 日

【解决方案2】：

hypot() 与幼稚的实现sqrt(a*a+b*b) 相比，在避免中间计算中的上溢和下溢方面会产生开销。这涉及缩放操作。在较旧的实现中，缩放可能使用除法，这本身就是一个相当慢的操作。即使缩放是通过直接指数操作完成的，在旧的处理器架构上它也可能相当慢，因为在整数 ALU 和浮点单元之间传输数据相当慢（例如，涉及到内存的往返）。数据驱动的分支对于各种扩展方法也很常见；这些很难预测。

数学库设计者的一个常见目标是实现对 hypot() 等简单数学函数的忠实舍入，即最大误差小于 1 ulp。与朴素解决方案相比，这种准确性的提高意味着必须以高于原生精度的方式执行中间计算。一种经典的方法是将操作数分成“高”和“低”部分并模拟扩展精度。这增加了拆分的开销并增加了浮点运算的数量。最后，hypot() 的 ISO C99 规范（后来被 C++ 标准采用）规定了对 NaN 和无穷大的处理，这些计算不会自然而然地脱离直接计算。

较早的最佳做法的代表性示例是 __ieee754_hypot() in FDLIBM。虽然它声称最大误差为 1 ulp，但针对任意精度库的快速测试表明它实际上并未实现该目标，因为可以观察到高达 1.15 ulp 的误差。

自从提出这个问题以来，处理器架构的两项进步使hypot() 实现更加高效。一个是引入了融合乘加 (FMA) 操作，它在内部使用完整的产品进行加法，并在最后只应用一次舍入。这通常减轻了在中间计算中模拟稍高精度的需求，并且通过将两个操作组合到一条指令中也可以提高性能。另一个架构上的进步是允许浮点和整数运算在同一组寄存器上进行操作，从而降低了浮点操作数的位操作成本。

因此，在现代高性能数学库中，hypot() 的吞吐量通常约为sqrt() 的一半，例如，英特尔为 MKL 发布的performance data 中可以看到。

对于自己的实验，最好从 hypot() 的单精度实现开始，因为这允许在所有可能的参数值的总组合中的较大部分进行测试。这也允许在使用许多现代处理器架构提供的低精度倒数平方根近似时探索速度和准确性之间的权衡。一种这样的探索的示例性代码如下所示。

请注意，ISO C99（以及扩展的 C++）对 fmin() fmax() 施加的 NaN 处理要求并不总是与浮点最小/最大机器指令的功能相匹配，从而使这些功能变慢比他们可能。由于下面的代码分别处理 NaN，因此应该可以直接使用此类机器指令。代码编译时应进行全面优化，但也要严格遵守 IEEE-754。

基于对 500 亿个随机参数对的测试，以下代码变体观察到的最大错误为：(USE_BEEBE == 1): 1.51 ulp; （USE_BEEBE == 0 && USE_SQRT == 1）：1.02 ulp；（USE_BEEBE == 0 && USE_SQRT == 0 && USE_2ND_ITERATION == 0）：2.77 ulp； (USE_BEEBE == 0 && USE_SQRT == 0 && USE_2ND_ITERATION == 1)：1.02 ulp。

#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>
#include "immintrin.h"

uint32_t __float_as_uint32 (float a);
float __uint32_as_float (uint32_t a);
float sse_rsqrtf (float a);

float my_hypotf (float a, float b)
{
    float fa, fb, mn, mx, res, r, t;

    fa = fabsf (a);
    fb = fabsf (b);
    mx = fmaxf (fa, fb);
    mn = fminf (fa, fb);

#if USE_BEEBE
    /* Nelson H. F. Beebe, "The Mathematical Function Computation Handbook."
       Springer 2017
    */
    float s, c;

    r = mn / mx;
    t = fmaf (r, r, 1.0f);
    s = sqrtf (t);
    c = fmaf (-s, s, t) / (s + s);
    res = fmaf (mx, s, mx * c);
    if (mx == 0) res = mx; // fixup hypot(0,0)

#else // USE_BEEBE

    float scale_in, scale_out, s, v, w;
    uint32_t expo;

    /* compute scale factors */
    expo = __float_as_uint32 (mx) & 0xfc000000;
    scale_in = __uint32_as_float (0x7e000000 - expo);
    scale_out = __uint32_as_float (expo + 0x01000000);

    /* scale mx towards unity */
    mn = mn * scale_in;
    mx = mx * scale_in;

    /* mx in [2**-23, 2**6) */
    s = fmaf (mx, mx, mn * mn); // 0.75 ulp
#if USE_SQRT
    w = sqrtf (s);
#else // USE_SQRT
    r = sse_rsqrtf (s);
    /* use A. Schoenhage's coupled iteration for the square root */
    v = r * 0.5f;
    w = r * s;
    w = fmaf (fmaf (w, -w, s), v, w);
#if USE_2ND_ITERATION
    v = fmaf (fmaf (r, -w, 1), v, v);
    w = fmaf (fmaf (w, -w, s), v, w);
#endif // USE_2ND_ITERATION
    if (mx == 0) w = mx; // fixup hypot(0,0)
#endif // USE_SQRT

    /* reverse previous scaling */
    res = w * scale_out;
#endif // USE_BEEBE

    /* check for special cases: infinity and NaN */
    t = a + b;
    if (!(fabsf (t) <= INFINITY)) res = t; // isnan(t)
    if (mx == INFINITY) res = mx; // isinf(mx)
    return res;
}

uint32_t __float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float __uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float sse_rsqrtf (float a)
{
    __m128 b, t;
    float res;
    int old_mxcsr;
    old_mxcsr = _mm_getcsr();
    _MM_SET_FLUSH_ZERO_MODE (_MM_FLUSH_ZERO_OFF); 
    _MM_SET_ROUNDING_MODE (_MM_ROUND_NEAREST);
    b = _mm_set_ss (a);
    t = _mm_rsqrt_ss (b);
    _mm_store_ss (&res, t);
    _mm_setcsr (old_mxcsr);
    return res;

}

hypot() 的基于 FMA 的简单双精度实现如下所示：

uint64_t __double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

double __uint64_as_double (uint64_t a)
{
    double r;
    memcpy (&r, &a, sizeof r);
    return r;
}

double my_hypot (double a, double b)
{
    double fa, fb, mn, mx, scale_in, scale_out, r, s, t;
    uint64_t expo;

    fa = fabs (a);
    fb = fabs (b);
    mx = fmax (fa, fb);
    mn = fmin (fa, fb);

    /* compute scale factors */
    expo = __double_as_uint64 (mx) & 0xff80000000000000ULL;
    scale_in = __uint64_as_double (0x7fc0000000000000ULL - expo);
    scale_out = __uint64_as_double (expo + 0x0020000000000000ULL);

    /* scale mx towards unity */
    mn = mn * scale_in;
    mx = mx * scale_in;

    /* mx in [2**-52, 2**6) */
    s = fma (mx, mx, mn * mn); // 0.75 ulp
    r = sqrt (s);

    /* reverse previous scaling */
    r = r * scale_out;

    /* check for special cases: infinity and NaN */
    t = a + b;
    if (!(fabs (t) <= INFINITY)) r = t; // isnan(t)
    if (mx == INFINITY) r = mx; // isinf(mx)

    return r;
}

【讨论】：

【解决方案3】：

我发现 Math.hypot() 非常慢。我发现我可以使用产生相同结果的相同算法编写一个快速的 Java 版本。这可以用来替换它。

/**
 * <b>hypot</b>
 * @param x
 * @param y
 * @return sqrt(x*x +y*y) without intermediate overflow or underflow. 
 * @Note {@link Math#hypot} is unnecessarily slow.  This returns the identical result to 
 * Math.hypot with reasonable run times (~40 nsec vs. 800 nsec). 
 * <p>The logic for computing z is copied from "Freely Distributable Math Library" 
 * fdlibm's e_hypot.c. This minimizes rounding error to provide 1 ulb accuracy.
 */
public static double hypot(double x, double y) {

    if (Double.isInfinite(x) || Double.isInfinite(y)) return Double.POSITIVE_INFINITY;
    if (Double.isNaN(x) || Double.isNaN(y)) return Double.NaN;

    x = Math.abs(x);
    y = Math.abs(y);

    if (x < y) {
        double d = x;
        x = y;
        y = d;
    }

    int xi = Math.getExponent(x);
    int yi = Math.getExponent(y);

    if (xi > yi + 27) return x;

    int bias = 0;
    if (xi > 510 || xi < -511) {
        bias = xi;
        x = Math.scalb(x, -bias);
        y = Math.scalb(y, -bias);           
    }


    // translated from "Freely Distributable Math Library" e_hypot.c to minimize rounding errors
    double z = 0; 
    if (x > 2*y) { 
        double x1 = Double.longBitsToDouble(Double.doubleToLongBits(x) & 0xffffffff00000000L);
        double x2 = x - x1;
        z = Math.sqrt(x1*x1 + (y*y + x2*(x+x1)));
    } else {
        double t = 2 * x;
        double t1 = Double.longBitsToDouble(Double.doubleToLongBits(t) & 0xffffffff00000000L);
        double t2 = t - t1;
        double y1 = Double.longBitsToDouble(Double.doubleToLongBits(y) & 0xffffffff00000000L);
        double y2 = y - y1;
        double x_y = x - y;
        z = Math.sqrt(t1*y1 + (x_y*x_y + (t1*y2 + t2*y))); // Note: 2*x*y <= x*x + y*y
    }

    if (bias == 0) {
        return z; 
    } else {
        return Math.scalb(z, bias);
    }
}

【讨论】：

【解决方案4】：

http://steve.hollasch.net/cgindex/math/pythag-root.txt

建议对于 Moler & Morrison，使用 sqrt() 的更快实现是二次方与三次方，具有大致相同的溢出特性。

【讨论】：

【解决方案5】：

这是一个更快的实现，其结果也更接近于 java.lang.Math.hypot：（注意：对于 Delorie 的实现，需要添加对 NaN 和 +-Infinity 输入的处理）

private static final double TWO_POW_450 = Double.longBitsToDouble(0x5C10000000000000L);
private static final double TWO_POW_N450 = Double.longBitsToDouble(0x23D0000000000000L);
private static final double TWO_POW_750 = Double.longBitsToDouble(0x6ED0000000000000L);
private static final double TWO_POW_N750 = Double.longBitsToDouble(0x1110000000000000L);
public static double hypot(double x, double y) {
    x = Math.abs(x);
    y = Math.abs(y);
    if (y < x) {
        double a = x;
        x = y;
        y = a;
    } else if (!(y >= x)) { // Testing if we have some NaN.
        if ((x == Double.POSITIVE_INFINITY) || (y == Double.POSITIVE_INFINITY)) {
            return Double.POSITIVE_INFINITY;
        } else {
            return Double.NaN;
        }
    }
    if (y-x == y) { // x too small to substract from y
        return y;
    } else {
        double factor;
        if (x > TWO_POW_450) { // 2^450 < x < y
            x *= TWO_POW_N750;
            y *= TWO_POW_N750;
            factor = TWO_POW_750;
        } else if (y < TWO_POW_N450) { // x < y < 2^-450
            x *= TWO_POW_750;
            y *= TWO_POW_750;
            factor = TWO_POW_N750;
        } else {
            factor = 1.0;
        }
        return factor * Math.sqrt(x*x+y*y);
    }
}

【讨论】：