一个浮点数可以在给定范围内表示多少个值？答案

【问题标题】：How many values can be represented in a given range by a float?一个浮点数可以在给定范围内表示多少个值？
【发布时间】：2019-05-20 21:49:49
【问题描述】：

直觉告诉我，由于 32 位可以表示固定数量的不同值，因此浮点数可以表示任何给定范围的固定数量的值。这是真的？处理转换的方式能够表示的值的数量是否有任何损失？

假设我在 [10³⁰, 10³⁵] 范围内选择一个数字。显然我在这个范围内能得到的精度是有限的，但是与更合理的范围如 [0.0, 1000.0] 相比，在这个范围内可以表示的值的数量有什么不同吗？

【问题讨论】：

标签： floating-point precision

【解决方案1】：

此答案假定 float 映射到 IEEE-754 (2008) 标准指定的 binary32 类型。对于 规范化 binary32 操作数，即在 [2^-126, 2¹²⁸) 中，总是正好有 2^{23 sup> 每个二进制编码的编码，因为存储的有效位数为 23。在一般情况下确定binary32 编码的数量有点棘手，例如由于舍入效应：并非所有的 10 次方都可以精确表示。在一个binade中，起点和终点的位置也有所不同，我们需要考虑[0, 2^-126]中的次正规。}

但首先我们可以估计 [10³⁰, 10³⁵] 中的编码数量大致与[10^-2, 10³]，因此区间 [0, 10³] 将包含比区间 [10³⁰, 10³⁵].

确定准确计数的懒惰方法是暴力计算给定间隔内的编码数量。 C 和 C++ 标准数学库提供了一个函数 nextafterf，该函数将给定的 binary32 操作数在指示的方向上递增或递减到其最近的邻居。因此，我们可以简单地计算在指定的时间间隔内我们能够执行多少次。使用此方法的 ISO-C99 程序如下所示。在现代硬件上给我们想要的答案只需要几秒钟：

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

/* count the binary32 numbers in the closed interval [start, stop] */
void countem (float start, float stop)
{
    float x;
    uint32_t count;
    count = 0;
    x = start;
    while (x <= stop) {
        count++;
        x = nextafterf (x, INFINITY);
    }
    printf ("there are %u binary32 numbers in [%15.8e, %15.8e]\n", count, start, stop);
}

int main (void)
{
    countem (0.0f, 1000.0f);
    countem (1e-2f, 1e3f);
    countem (1e30f, 1e35f);
    return EXIT_SUCCESS;
}

这个程序确定：

there are 1148846081 binary32 numbers in [0.00000000e+000, 1.00000000e+003]
there are 139864311 binary32 numbers in [9.99999978e-003, 1.00000000e+003]
there are 139468867 binary32 numbers in [1.00000002e+030, 1.00000004e+035]

【讨论】：

【解决方案2】：

一个浮点数可以在给定范围内表示多少个值？

... 因为 32 位可以表示固定数量的不同值，float 可以表示任何给定范围的固定数量的值。这是真的吗？

是的 - 是的。在整个typical float 范围内，可以表示大约 2³² 个不同的值。

处理转换的方式能够表示的值的数量是否有任何损失？

non sequitur。 float 没有定义其他数字表示如何与float 相互转换。 printf(), scanf(), atof(), strtof(), (float) some_integer, (some_integer_type) some_float 和编译器本身都执行转换。 C 对转换必须发生的程度不明确。高质量的库和编译器应尽可能发挥最佳性能。对于源代码或像"1.2345" 这样的“字符串”数字，有无限多的可能值映射到大约 2³² 个不同的值。是的，发生了损失。

... 在 [1030, 1035] 范围内。 ...与更合理的范围（如 [0.0, 1000.0]）相比，在此范围内可以表示的值的数量有什么不同吗？

是的。 float 的值是 distributed logarithmically，而不是线性。在 [1030, 1035] 之间，不同的 float 与 [1.030, 1.035] 或 [1.030e-3, 1.035e-3] 之间的数量差不多。大约 25% 的 float 位于 [0.0 ... 1.0] 范围内，因此 [0.0, 1000.0] 中的值比 [1030, 1035] 多很多倍

【讨论】：

【解决方案3】：

这是为了提供信息而提供的——它可以用来提供更容易使用的信息，例如提供计数的代码、各种范围的样本或讨论——但我没时间了，想保留这些信息，所以远。

对于 IEEE-754 基本 32 位二进制浮点数，N(x) 个非负可表示值小于或等于非负 x em> 是：

2²³•254 如果 2¹²⁸ ≤ x。
2²³•(floor(log₂x)+127) + floor(x/2 ^{floor(log₂x)−23})−2²³+1 如果 2^{−126 ≤ x 128}.
floor(x/2^−126−23)+1 如果 x −126。

所以ax≤b中可表示值x的个数为N( b)-N(a).

解释：

在第一种情况下，2²³•254 是可表示的非负有限值的数量，2²³ 代表每个指数值，包括次正规和零。
在第二种情况下，2²³•(floor(log₂x)+127) 是 2^{23 sup> 对于低于 x 的每个 binade，包括次正规和零。为此，我们将小于 x 的数字添加到 x 的二进制中，这是通过计算 x 的 24 位整数有效位得到的（向下舍入）作为 floor(x/2^{floor(log₂x)−23}) 然后减去2²³-1 用于计算从第一个正常有效位 (2²³) 到 x 的有效位（含）的有效位。}
在第三种情况下，次正规有效数字的间隔为 2^-126-23，因此我们只计算整个间隔并包括端点。

【讨论】：

【解决方案4】：

这是计算在所有有限范围内float 中可表示的值的数量的代码。它需要 IEEE-754 算法。我改编自my previous C++ answer。

这有两种将浮点数转换为其编码的实现（一种通过复制位，一种通过数学操作）。之后，距离计算就比较简单了（负值要调整，然后距离就是简单的减法）。

#include <float.h>
#include <inttypes.h>
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tgmath.h>


/*  Define a value with only the high bit of a uint32_t set.  This is also the
    encoding of floating-point -0.
*/
static const uint32_t HighBit = UINT32_MAX ^ UINT32_MAX>>1;


//  Return the encoding of a floating-point number by copying its bits.
static uint32_t EncodingBits(float x)
{
    uint32_t result;
    memcpy(&result, &x, sizeof result);
    return result;
}


//  Return the encoding of a floating-point number by using math.
static uint32_t EncodingMath(float x)
{
    static const int SignificandBits = FLT_MANT_DIG;
    static const int MinimumExponent = FLT_MIN_EXP;

    //  Encode the high bit.
    uint32_t result = signbit(x) ? HighBit : 0;

    //  If the value is zero, the remaining bits are zero, so we are done.
    if (x == 0) return result;

    /*  The C library provides a little-known routine to split a floating-point
        number into a significand and an exponent.  Note that this produces a
        normalized significand, not the actual significand encoding.  Notably,
        it brings significands of subnormals up to at least 1/2.  We will
        adjust for that below.  Also, this routine normalizes to [1/2, 1),
        whereas IEEE 754 is usually expressed with [1, 2), but that does not
        bother us.
    */
    int xe;
    float xf = frexp(fabs(x), &xe);

    //  Test whether the number is subnormal.
    if (xe < MinimumExponent)
    {
        /*  For a subnormal value, the exponent encoding is zero, so we only
            have to insert the significand bits.  This scales the significand
            so that its low bit is scaled to the 1 position and then inserts it
            into the encoding.
        */
        result |= (uint32_t) ldexp(xf, xe - MinimumExponent + SignificandBits);
    }
    else
    {
        /*  For a normal value, the significand is encoded without its leading
            bit.  So we subtract .5 to remove that bit and then scale the
            significand so its low bit is scaled to the 1 position.
        */
        result |= (uint32_t) ldexp(xf - .5, SignificandBits);

        /*  The exponent is encoded with a bias of (in C++'s terminology)
            MinimumExponent - 1.  So we subtract that to get the exponent
            encoding and then shift it to the position of the exponent field.
            Then we insert it into the encoding.
        */
        result |= ((uint32_t) xe - MinimumExponent + 1) << (SignificandBits-1);
    }

    return result;
}


/*  Return the encoding of a floating-point number.  For illustration, we
    get the encoding with two different methods and compare the results.
*/
static uint32_t Encoding(float x)
{
    uint32_t xb = EncodingBits(x);
    uint32_t xm = EncodingMath(x);

    if (xb != xm)
    {
        fprintf(stderr, "Internal error encoding %.99g.\n", x);
        fprintf(stderr, "\tEncodingBits says %#" PRIx32 ".\n", xb);
        fprintf(stderr, "\tEncodingMath says %#" PRIx32 ".\n", xm);
        exit(EXIT_FAILURE);
    }

    return xb;
}


/*  Return the distance from a to b as the number of values representable in
    float from one to the other.  b must be greater than or equal to a.  0 is
    counted only once.
*/
static uint32_t Distance(float a, float b)
{
    uint32_t ae = Encoding(a);
    uint32_t be = Encoding(b);

    /*  For represented values from +0 to infinity, the IEEE 754 binary
        floating-points are in ascending order and are consecutive.  So we can
        simply subtract two encodings to get the number of representable values
        between them (including one endpoint but not the other).

        Unfortunately, the negative numbers are not adjacent and run the other
        direction.  To deal with this, if the number is negative, we transform
        its encoding by subtracting from the encoding of -0.  This gives us a
        consecutive sequence of encodings from the greatest magnitude finite
        negative number to the greatest finite number, in ascending order
        except for wrapping at the maximum uint32_t value.

        Note that this also maps the encoding of -0 to 0 (the encoding of +0),
        so the two zeroes become one point, so they are counted only once.
    */
    if (HighBit & ae) ae = HighBit - ae;
    if (HighBit & be) be = HighBit - be;

    //  Return the distance between the two transformed encodings.
    return be - ae;
}


static void Try(float a, float b)
{
    printf("[%.99g, %.99g] contains %" PRIu32 " representable values.\n",
        a, b, Distance(a, b) + 1);
}


int main(void)
{
    if (sizeof(float) != sizeof(uint32_t))
    {
        fprintf(stderr, "Error, uint32_t must be the same size as float.\n");
        exit(EXIT_FAILURE);
    }

    /*  Prepare some test values:  smallest positive (subnormal) value, largest
        subnormal value, smallest normal value.
    */
    float S1 = FLT_TRUE_MIN;
    float N1 = FLT_MIN;
    float S2 = N1 - S1;

    //  Test 0 <= a <= b.
    Try( 0,  0);
    Try( 0, S1);
    Try( 0, S2);
    Try( 0, N1);
    Try( 0, 1./3);
    Try(S1, S1);
    Try(S1, S2);
    Try(S1, N1);
    Try(S1, 1./3);
    Try(S2, S2);
    Try(S2, N1);
    Try(S2, 1./3);
    Try(N1, N1);
    Try(N1, 1./3);

    //  Test a <= b <= 0.
    Try(-0., -0.);
    Try(-S1, -0.);
    Try(-S2, -0.);
    Try(-N1, -0.);
    Try(-1./3, -0.);
    Try(-S1, -S1);
    Try(-S2, -S1);
    Try(-N1, -S1);
    Try(-1./3, -S1);
    Try(-S2, -S2);
    Try(-N1, -S2);
    Try(-1./3, -S2);
    Try(-N1, -N1);
    Try(-1./3, -N1);

    //  Test a <= 0 <= b.
    Try(-0., +0.);
    Try(-0., S1);
    Try(-0., S2);
    Try(-0., N1);
    Try(-0., 1./3);
    Try(-S1, +0.);
    Try(-S1, S1);
    Try(-S1, S2);
    Try(-S1, N1);
    Try(-S1, 1./3);
    Try(-S2, +0.);
    Try(-S2, S1);
    Try(-S2, S2);
    Try(-S2, N1);
    Try(-S2, 1./3);
    Try(-N1, +0.);
    Try(-N1, S1);
    Try(-N1, S2);
    Try(-N1, N1);
    Try(-1./3, 1./3);
    Try(-1./3, +0.);
    Try(-1./3, S1);
    Try(-1./3, S2);
    Try(-1./3, N1);
    Try(-1./3, 1./3);

    return 0;
}

【讨论】：

【解决方案5】：

也许我在这里忽略了一些东西，但是看IEEE-754 binary32的位模式

你知道它被解码为：

(-1)^b₃₁ (1 + Sum(b_{23-i 2^-i;i = 22 ... 0 )) × 2^{e - 127}}

然后你会看到最小的指数是 0，最高的是 255。如果你将整数乘以 2¹²⁷，那么你会看到两个具有相同分数的浮点数的排序由指数 e 的阶定义，它是一个整数。因此，如果您想将 IEEE-754 binary32 数字从低到高排序，则

先在标志上排序，
指数秒
分数上的第三个

这实际上意味着浮点数的顺序与由相同位模式创建的相应整数的顺序相同。所以如果你想知道两个浮点数之间的距离，你只需要把对应的整数相减即可：（这里假设+0和-0会被平等对待）：

/* count the binary32 numbers in the closed half-open interval [start, stop[ */
int distance (float start, float stop)
{
    return *(reinterpret_cast<int *>(&stop)) - *(reinterpret_cast<int *>(&start));
}

图片取自维基百科：https://en.wikipedia.org/wiki/Single-precision_floating-point_format

【讨论】：