【问题标题】：How python rounds the float32 number after multiplying?python如何在乘法后对float32数字进行四舍五入？
【发布时间】：2019-06-03 03:44:38
【问题描述】：

我正在硬件上设计一个单精度浮点乘法器，并使用 python 来验证我的设计。这是单精度浮点格式。

s |指数 |尾数

相乘的结果是

(S1^S2)| E1 + E2 - 127 |尾数1 * 尾数2

在执行两个尾数相乘后，我在四舍五入时遇到问题。你知道，尾数是 24 位（23 和 1 个隐藏位），乘以两个 24 位将得到 48 位。尾数字段只能包含 23 位，所以我尝试截断它，如下所示： example of mantissa truncating。但是与 python float32 乘法相比，结果似乎并不正确。
所以我在这里问python如何处理尾数乘法的事情。谢谢。

【问题讨论】：

基础 Python 没有 float32 类型。你是说 NumPy 的 np.float32 吗？
@EricPostpischil 对不起，我的意思是浮动

标签： python floating-point verilog hardware

【解决方案1】：

Python（和许多其他语言）使用 IEEE754 默认舍入模式，称为“舍入到最近的关系到偶数”。你可以在这里阅读更多信息：https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules

【讨论】：

Python does not requite any particular floating-point format or behavior; the reference manual says “You are at the mercy of the underlying machine architecture (and C or Java implementation)…”。许多实现确实使用 IEEE-754 二进制浮点，具有舍入到最近的关系到偶数。
@EricPostpischil 你说得对。我希望编程语言实际上更明确地规定了这些要求；或者至少支持与底层硬件实现相对应的显式 ieee754 类型（如果可用）。

【解决方案2】：

在这个答案中，我将解释如何使用 IEEE-754 round-to-nearest-ties-to-even 规则对基本的 32 位二进制浮点进行舍入。这可能是您想要的。尽管许多 Python 实现使用 IEEE-754，但这不是 Python 文档所要求的，而且，即使在使用 IEEE-754 的实现中，也可能与 IEEE-754 标准存在偏差，因此您不应依赖 Python 作为权威参考。

给定一个 48 位有效位¹，让我们将位编号从 0（最高有效位）到 47（最低有效位）。（这与通常的方向相反，但便于讨论。）

如果乘积在浮点格式的正常范围内，则有效数字将被四舍五入：

考虑第 24 位和之后的尾随位。如果位 24 为 0，则向下舍入（仅使用位 0 到 23）。如果位 24 为 1 且任何尾随位为 1，则向上取整（取位 0 到 23 并将它们加一）。如果第 24 位为 1 且所有尾随位均为 0，如果第 23 位为偶数（零）则向下舍入，如果第 23 位为奇数（一）则向上舍入。
四舍五入时，这可能会使位 0 到 23 溢出（如果它们是 1111…111，则从位 0 进行加法运算）。在这种情况下，将它们右移一位并将指数加一。

如上所述，以上是针对正常情况的。处理所有情况：

测试为零：如果有效数为零，则返回零（带有计算的符号）。
归一化有效位： 如果有效位字段的第 0 位为 0，则将有效位左移一位并从指数中减去一位。重复此步骤，直到有效数字的第 0 位不为 0。
溢出测试：如果真指数超过 127（有偏指数超过 254），则产生无穷大（带有计算符号）。
限制指数：如果真正的指数（我称之为e）小于-126（有偏指数小于1），则将有效位右移-126 -e 位，在左边插入 0。将指数设置为 -126。如上所述将有效位四舍五入，位编号从 0 到 47+(−126−e)。
溢出测试：如果四舍五入将指数增加到 127，则产生无穷大（带有计算符号）。
产生正常结果：如果有效数字的第 0 位为 1，则有效数字字段使用第 1 到 23 位，指数字段使用 e+127。
产生次正规结果： 否则，有效位的第 0 位为 0。在这种情况下，有效位字段仍然使用第 1 位到第 23 位，但指数字段使用零。此指数字段表示次正规数。

请注意，“限制指数”步骤似乎撤消了“归一化有效位”步骤，使前者变得不必要，但这种组合可以处理所有正常和次规范操作数的情况，从而产生正常和次规范结果。

脚注

¹ “有效位”是 IEEE-754 标准中用于浮点数小数部分的术语。 “尾数”是对数的小数部分的旧术语。（有效数字是线性的。尾数是对数的。）

【讨论】：

精彩的解释！

【解决方案3】：

我过去曾做过这件事作为求职申请的练习：

生成随机浮点数并创建 IEEE754 值和乘积的测试向量的 Python 脚本：

import bitstring, random 

spread = 10000000
n = 100000

def ieee754(flt):
    b = bitstring.BitArray(float=flt, length=32)
    return b

if __name__ == "__main__":
    with open("TestVector", "w") as f:
        for i in range(n):
            a = ieee754(random.uniform(-spread, spread))
            b = ieee754(random.uniform(-spread, spread))

            # calculate desired product based on 32-bit ieee 754 
            # representation of the floating point of each operand
            ab = ieee754(a.float * b.float)

            f.write(a.bin + "_" + b.bin + "_" + ab.bin + "\n")

使用最接近偶数舍入进行 32 位浮点乘法的 Verilog 代码：

/*   Input and outputs are in single-precision IEEE 754 FP format:
     Sign     -   1 bit, position 32
     Exponent -  8 bits, position 31-24
     Mantissa - 23 bits, position 22-1

Not implemented: special cases like inf, NaN
*/

// indices of components of IEEE 754 FP
`define SIGN    31
`define EXP     30:23
`define M       22:0

`define P       24    // number of bits for mantissa (including 
`define G       23    // guard bit index
`define R       22    // round bit index
`define S       21:0  // sticky bits range

`define BIAS    127


module fp_mul(input  wire clk,
              input  wire[31:0] a,
              input  wire[31:0] b,
              output wire[31:0] y);

    reg [`P-2:0] m;
    reg [7:0] e;
    reg s;

    reg [`P*2-1:0] product;
    reg G;
    reg R;
    reg S;
    reg normalized;

    reg state;
    reg next_state = 0;

    parameter STEP_1 = 1'b0, STEP_2 = 1'b1;


    always @(posedge clk) begin
        state <= next_state;
    end


    always @(state) begin

        case(state)
            STEP_1: begin
                // mantissa is product of a and b's mantissas, 
                // with a 1 added as the MSB to each
                product = {1'b1, a[`M]} * {1'b1, b[`M]};

                // get sticky bits by ORing together all bits right of R
                S = |product[`S]; 

                // if the MSB of the resulting product is 0
                // normalize by shifting right    
                normalized = product[47];
                if(!normalized) product = product << 1; 

                next_state = STEP_2;            
                end 

            STEP_2: begin
                // if either mantissa is 0, result is 0 
                if(!a[`M] | !b[`M]) begin 
                    s = 0; e = 0; m = 0;
                end else begin
                    // sign is xor of signs
                    s = a[`SIGN] ^ b[`SIGN];

                    // mantissa is upper 22-bits of product w/ nearest-even rounding
                    m = product[46:24] + (product[`G] & (product[`R] | S));

                    // exponent is sum of a and b's exponents, minus the bias 
                    // if the mantissa was shifted, increment the exponent to balance it
                    e = a[`EXP] + b[`EXP] - `BIAS + normalized;
                end 

                next_state = STEP_1;
                end
        endcase
    end

    // output is concatenation of sign, exponent, and mantissa  
    assign y = {s, e, m};
endmodule

希望有帮助！

【讨论】：

我试过你的方法（截断位），但它的结果不等于python上的结果。
它们与您的 python 脚本或我的 python 脚本的结果不匹配？