存储非负浮点值答案

【问题标题】：Storing non-negative floating point values存储非负浮点值
【发布时间】：2012-04-26 15:32:01
【问题描述】：

有没有一种有效的方法可以使用现有的float32 和float64 格式存储非负浮点值？

想象一下默认的float32 行为允许负/正：

val = bytes.readFloat32();

如果不需要负值，是否可以允许更大的正值？

val = bytes.readFloat32() + 0xFFFFFFFF;

编辑：本质上，当我知道我只存储正值时，可以稍微修改浮点格式，以允许相同数量的位具有更大的范围或精度。

例如。 float32 格式定义为 1 位 表示符号，8 位 表示指数，23 位 表示分数

如果我不需要符号位怎么办，我们可以用 8 位 表示指数，24 位 表示分数，以便为相同的 32 位提供更高的精度？

【问题讨论】：

您是否认为它们就像 int 可以制成无符号 int 并获得另一个位？我不这么认为......否则它已经完成了。
你真的那么需要额外的范围吗？
告诉我们更多关于您的问题。你为什么要这样做？你可能问错了问题。

标签： c++ floating-point unsigned primitive-types

【解决方案1】：

有almost no support for unsigned float in hardware，所以你不会有这种现成的功能，但你仍然可以通过将最低有效位存储在符号位中来获得非常有效的无符号浮点数。通过这种方式，您可以利用可用的浮点硬件支持，而不是编写软件浮点解决方案。为此，您可以

每次操作后手动操作

这样，您需要对 lsb（A.K.A 符号位）进行一些小的修正，例如 1 个更长的除法步长，或用于加法的 1 位加法器
或者如果可以的话，以更高的精度进行数学运算

例如，如果类型是float，您可以在double 中进行操作，然后在存储时转换回float

这是一个简单的 PoC 实现：

#include <cmath>
#include <cfenv>
#include <bit>
#include <type_traits>

// Does the math in double precision when hardware double is available
#define HAS_NATIVE_DOUBLE

class UFloat
{
public:
    UFloat(double d) : UFloat(0.0f)
    {
        if (d < 0)
            throw std::range_error("Value must be non-negative!");
        uint64_t dbits = std::bit_cast<uint64_t>(d);
        bool lsb = dbits & lsbMask;
        dbits &= ~lsbMask; // turn off the lsb
        d = std::bit_cast<double>(dbits);
        value = lsb ? -(float)d : (float)d;
    }

    UFloat(const UFloat &rhs) : UFloat(rhs.value) {}

    // =========== Operators ===========
    UFloat &operator+=(const UFloat &rhs)
    {
#ifdef HAS_NATIVE_DOUBLE
        // Calculate in higher precision then round back
        setValue((double)value + rhs.value);
#else
        // Calculate the least significant bit manually
        
        bool lhsLsb = std::signbit(value);
        bool rhsLsb = std::signbit(rhs.value);
        // Clear the sign bit to get the higher significant bits
        // then get the sum
        value = std::abs(value);
        value += std::abs(rhs.value);
        if (std::isfinite(value))
        {
            if (lhsLsb ^ rhsLsb) // Only ONE of the 2 least significant bits is 1
            {
                // The sum's lsb is 1, so we'll set its sign bit
                value = -value;
            }
            else if (lhsLsb)
            {
                // BOTH least significant bits are 1s,
                // so we'll add the carry to the next bit
                value = std::nextafter(value, INFINITY);
                // The lsb of the sum is 0, so the sign bit isn't changed
            }
        }
#endif
        return *this;
    }

    UFloat &operator*=(const UFloat &rhs)
    {
#ifdef HAS_NATIVE_DOUBLE
        // Calculate in higher precision then round back
        setValue((double)value * rhs.value);
#else
        // Calculate the least significant bit manually
    
        bool lhsLsb = std::signbit(value);
        bool rhsLsb = std::signbit(rhs.value);

        // Clear the sign bit to get the higher significant bits
        // then get the product
        float lhsMsbs = std::abs(value);
        float rhsMsbs = std::abs(rhs.value);

        // Suppose we have X.xPm with
        //     X: the high significant bits
        //     x: the least significant one
        // and m: the exponent. Same to Y.yPn
        // X.xPm * Y.yPn = (X + 0.x)*2^m * (Y + 0.y)*2^n
        //               = (X + x/2)*2^m * (Y + y/2)*2^n
        //               = (X*Y + X*y/2 + Y*x/2 + x*y/4)*2^(m + n)
        value = lhsMsbs * rhsMsbs; // X*Y
        if (std::isfinite(value))
        {
            uint32_t rhsMsbsBits = std::bit_cast<uint32_t>(rhsMsb);
            value += rhsMsbs*lhsLsb / 2; // X*y/2
            
            uint32_t lhsMsbsBits = std::bit_cast<uint32_t>(lhsMsbs);
            value += lhsMsbs*rhsLsb / 2; // Y*x/2
            
            int lsb = (rhsMsbsBits | lhsMsbsBits) & 1; // the product's lsb
            lsb += lhsLsb & rhsLsb;
            if (lsb & 1)
                value = -value; // set the lsb
            if (lsb > 1)    // carry to the next bit
                value = std::nextafter(value, INFINITY);
        }
#endif

        return *this;
    }
    
    UFloat &operator/=(const UFloat &rhs)
    {
#ifdef HAS_NATIVE_DOUBLE
        // Calculate in higher precision then round back
        setValue((double)value / rhs.value);
#else
        // Calculate the least significant bit manually
        // Do just one more step of long division,
        // since we only have 1 bit left to divide

        throw std::runtime_error("Not Implemented yet!");
#endif

        return *this;
    }

    double getUnsignedValue() const
    {
        if (!std::signbit(value))
        {
            return value;
        }
        else
        {
            double result = std::abs(value);
            uint64_t doubleValue = std::bit_cast<uint64_t>(result);
            doubleValue |= lsbMask; // turn on the least significant bit
            result = std::bit_cast<double>(doubleValue);
            return result;
        }
    }
    
private:
    // The unsigned float value, with the least significant bit (lsb)
    // being stored in the sign bit
    float value;
    
    // the first bit after the normal mantissa bits
    static const uint64_t lsbMask = 1ULL << (DBL_MANT_DIG - FLT_MANT_DIG - 1);

    // =========== Private Constructor ===========
    UFloat(float rhs) : value(rhs)
    {
        std::fesetround(FE_TOWARDZERO); // We'll round the value ourselves
#ifdef HAS_NATIVE_DOUBLE
        static_assert(sizeof(float) < sizeof(double));
#endif
    }

    void setValue(double d)
    {
        // get the bit pattern of the double value
        auto bits = std::bit_cast<std::uint64_t>(d);
        bool lsb = bits & lsbMask;

        // turn off the lsb to avoid rounding when converting to float
        bits &= ~lsbMask;
        d = std::bit_cast<double>(bits);

        value = (float)d;
        if (lsb)
            value = -value;
    }
}

为了获得正确的 lsb，可能需要进行更多调整

无论哪种方式，您都需要比平时更多的操作，因此这可能只适用于需要考虑缓存占用的大型阵列。在这种情况下，我建议仅将其用作存储格式，就像在大多数当前架构上如何处理 FP16 一样：只有加载/存储指令扩展为 float 或 @ 987654328@ 并转换回来。所有算术运算仅在float 或double 中完成

所以无符号浮点数应该只存在于内存中，并且在加载时将被解码为完整的double。这样您就可以使用原生 double 类型，并且不需要在每个运算符之后进行更正

或者，这可以与 SIMD 一起使用，以同时对多个无符号浮点数进行操作

【讨论】：

【解决方案2】：

不，不是免费的。

您可以使用其他数字表示以多种方式扩展范围/精度。意图不明确，如果您希望使用另一种数字表示（大小相等）获得float 或double 的范围和准确性，性能通常会很差。

只要坚持使用float 或double，除非性能/存储非常重要，并且您可以使用另一种数字表示来很好地（或更好！）表示您的值。

【讨论】：

【解决方案3】：

浮点数（float32 和 float64）有一个明确的符号位。浮点数不存在等效的无符号整数。

因此，没有简单的方法将正浮点数的范围加倍。

【讨论】：