整数 sqrt 的精度使用 double答案

【问题标题】：Accuracy of integer sqrt using double整数 sqrt 的精度使用 double
【发布时间】：2017-12-04 17:39:10
【问题描述】：

我想计算uint64_t 的整数部分。对于 32 位的uint32_t，经常建议先转换为double、sqrt，然后再转换回uint32_t。

它是否也适用于uint64_t，因为double 只能容纳最多 2^53 的数字？即，以下总是会给出正确的答案：

#include <math.h>
uint64_t x = ...;
uint64_t result = (uint64_t)sqrt((double)x);

甚至：

#include <math.h>
uint64_t x = ...;
uint32_t result = (uint32_t)sqrt((double)x);

【问题讨论】：

即使对于uint32_t，这只有在您知道您的数学库的sqrt 很好并且您的C 实现对浮点运算很好时才可靠。单独的 C 标准不需要这个。即使对于具有可表示的精确平方根的值，一些数学库也仅返回近似结果。
我写了 an answer 推荐 32 位整数的双重策略，但它是为了回应一个 Java 问题。答案取决于特定于 Java 的保证，不适用于 C。

标签： floating-point square-root

【解决方案1】：

根据经验，答案是否。输入 4503599761588224 的结果被错误地计算为 67108865 而不是 67108864。

以下代码标识了这种情况。¹当然，你可以去掉break;来观察其他情况。

#include <stdio.h>
#include <stdint.h>
#include <math.h>

int main(void) {
    for (uint32_t y = 1; y != 0; y++) {
        // *Just* smaller than a perfect square
        uint64_t x = ((uint64_t)y * (uint64_t)y) - 1;

        // We expect the floor of the result     
        uint32_t expected = y - 1;

        uint32_t result = (uint32_t)sqrt((double)x);

        if (result != expected) {
            printf("Incorrect: x = %llu, result = %u\n", x, result);
            break;
        }
    }
    return 0;
}

值 4503599761588224 有什么特别之处？嗯，正好是 (2²⁶ + 1)² - 1，AKA (2⁵² + 2²⁷) .这可以用double 精确表示，因此错误不是由于long -> double 转换造成的。

相反，错误是 sqrt 实现的内部错误。这里的 delta（相对于一个完美的正方形）将平方根减少了大约 2^-27，这比 result 本身小了大约 2⁵³ 倍。这是双精度可以处理的极限，所以我们自然希望看到此时会出现错误。²

1。 Live demo.

2。感谢@EricPostpischil 在下面的 cmets 中确定根本原因:)

【讨论】：

当然，如果给定数学库是好的并且返回一个正确舍入的平方根，那么我们知道结果很接近并且可以很容易地用整数算术测试和纠正它。
在 2**26 发生故障的原因是 sqrt(x) 的导数是 1/(2*sqrt(x))。因此，在 (2**26)**2 处，减一会使平方根减少约 2**-27。并且平方根刚好在 2**26 以下，所以减少量大约是平方根的 2**-53 倍，所以你刚刚达到了双精度的边缘。
即使数学库只是非常近似正确，也可以将平方根括起来并进行二分搜索。
@EricPostpischil - 很好的解释。我已将此纳入我的答案（感谢您！）。