将常数（2 的幂）除以整数的技巧答案

【问题标题】：Trick to divide a constant (power of two) by an integer将常数（2 的幂）除以整数的技巧
【发布时间】：2016-05-05 22:20:50
【问题描述】：

注意这是一个理论问题。我对我的实际代码的性能感到满意。我只是好奇是否有替代方案。

有没有一个技巧可以用一个整数变量值对一个常数值（它本身是 2 的整数幂）进行整数除法，而不必使用实际的除法运算？

// The fixed value of the numerator
#define SIGNAL_PULSE_COUNT 0x4000UL

// The division that could use a neat trick.
uint32_t signalToReferenceRatio(uint32_t referenceCount)
{
    // Promote the numerator to a 64 bit value, shift it left by 32 so
    // the result has an adequate number of bits of precision, and divide
    // by the numerator.
    return (uint32_t)((((uint64_t)SIGNAL_PULSE_COUNT) << 32) / referenceCount);
}

我找到了几个（很多）关于除以常数的技巧的参考资料，包括整数和浮点数。例如，What's the fastest way to divide an integer by 3? 问题有很多很好的答案，包括对其他学术和社区资料的引用。

鉴于分子是恒定的，并且它是 2 的整数幂，是否有一个巧妙的技巧可以用来代替实际的 64 位除法？某种按位运算（移位、AND、XOR、那种东西）或类似的？

我不希望任何精度损失（由于整数舍入可能超过半位）大于进行实际除法的损失，因为仪器的精度取决于此测量的精度。

_{“让编译器决定”不是答案，因为我想知道有没有技巧。}

额外的上下文信息

我正在开发一个 16 位数据、24 位指令字微控制器的驱动程序。驱动程序对外围模块进行一些魔术，以获得固定数量的信号频率脉冲的参考频率的脉冲计数。所需结果是信号脉冲与参考脉冲的比率，表示为无符号 32 位值。该函数的算法由我正在为其开发驱动程序的设备的制造商定义，并进一步处理结果以获得浮点现实世界值，但这超出了本问题的范围。

我使用的微控制器有一个数字信号处理器，它有许多我可以使用的除法运算，如果有必要我不害怕这样做。除了将汇编指令放在一起以使其工作之外，这种方法还需要克服一些小挑战，例如 DSP 用于在 BLDC 驱动程序 ISR 中执行 PID 功能，但没有什么是我无法管理的。

【问题讨论】：

即使有一个，我也不会使用 C 而是汇编。这样您就可以确定不会执行任何优化，并且可以按照您的意愿对所有内容进行编程。
没有 16 位 ARM 内核！并将优化留给您的编译器。不要做过早的优化。生成的汇编代码是什么？并且：优化除法，但随后使用浮点声音......不一致。
你希望这个技巧能做什么？它应该给你什么是正常除法没有的？
“诀窍”可能是使用编译时常量，然后确保函数是内联的。然后编译器将能够根据具体情况进行最佳优化。
一个“trick”的属性归结为1/referenceCount 并组成由SIGNAL_PULSE_COUNT 缩放的分数，OP 可以容忍一个小错误，直接power_of_2/x 太慢了。假设SIGNAL_PULSE_COUNT == 0 不是问题。给这篇文章一些时间。

标签： c optimization theory integer-arithmetic

【解决方案1】：

你不能使用聪明的数学技巧来不做除法，但如果你知道引用计数的范围，你当然仍然可以使用编程技巧：

就速度而言，没有什么比预先计算的查找表更好的了。
有快速近似平方根算法（可能已经在您的 DSP 中），您可以通过一两次 Newton-Raphson 迭代来改进近似值。如果使用浮点数进行计算对您来说足够准确，那么您可能会在速度方面击败 64 位整数除法（但不是在代码的清晰度方面）。

您提到稍后将结果转换为浮点数，完全不计算整数除法可能会有所帮助，而是使用您的浮点硬件。

【讨论】：

这是正确的。只是感觉太容易了。我的脑海中形成了一个“聪明的把戏”（这可能是错误的）的形状。关于使用模乘法逆，其模是最接近SIGNAL_PULSE_COUNT 的素数，然后使用欧拉定理的特殊情况得出一个近似值...
我的意思当然是fast approximate reciprocal square root。如果您对计算倒数的魔法感兴趣，您绝对应该检查该方法。

【解决方案2】：

我用定点算法制作了一个 Matlab 版本。

此方法假定可以有效地计算整数版本的log2(x)，这对于具有检测整数的最高有效位 1 的指令的 dsPIC30/33F 和 TI C6000 来说是正确的。

因此，此代码具有很强的 ISA 依赖性，不能用可移植/标准 C 编写，可以使用乘加、乘加和移位等指令进行改进，所以我不会尝试翻译它到 C.

nrdiv.m

function [ y ] = nrdiv( q, x, lut) 
                          % assume q>31, lut = 2^31/[1,1,2,...255]
    p2 = ceil(log2(x));   % available in TI C6000, instruction LMBD
                          % available in Microchip dsPIC30F/33F, instruction FF1L 
    if p2<8
        pre_shift=0;
    else
        pre_shift=p2-8;
    end                                  % shr = (p2-8)>0?(p2-8):0;

    xn = shr(x, pre_shift);              % xn  = x>>pre_shift;
    y  = shr(lut(xn), pre_shift);        % y   = lut[xn]>pre_shift; 
    y  = shr(y * (2^32 - y*x), 30);      % basic iteration
                                         % step up from q31 to q32
    y  = shr(y * (2^33 - y*x), (64-q));  % step up from q32 to desired q
    if q>39
        y = shr(y * (2^(1+q) - y*x), (q));  % when q>40, additional 
                                            % iteration is required, 
    end                                     % no step up is performed
end
function y = shr(x, r)
    y=floor(x./2^r);             % simulate operator >>
end

test.m

test_number = (2^22-12345);
test_q      = 48;

lut_q31 = round(2^31 ./ [1,[1:1:255]]);
display(sprintf('tested 2^%d/%d, diff=%f\n',test_q, test_number,...
                 nrdiv( 39, (2^22-5), lut_q31) - 2^39/(2^22-5)));

样本输出

tested 2^48/4181959, diff=-0.156250

参考：

Newton–Raphson division

【讨论】：

【解决方案3】：

有点晚了，但这是我的解决方案。

首先是一些假设：

问题：

X=N/D 其中 N 是常数，是 2 的幂。

所有 32 位 无符号 整数。

X 未知，但我们有一个很好的估计（以前但不再准确的解决方案）。

不需要精确的解决方案。

注意：由于整数截断，这不是一个准确的算法！

迭代解决方案是可以的（每个循环都会改进）。

除法比乘法昂贵得多：

对于 Arduino UNO 的 32 位无符号整数：

'+/-' ~0.75us

'*' ~3.5us

'/' ~36us 4 我们寻求替换基本上让我们从牛顿法开始：

Xnew=Xold-f(x)/(f`(x)

其中 f(x)=0 表示我们寻求的解决方案。

我得到解决这个问题：

Xnew=XNew*(C-X*D)/N

其中 C=2*N

第一招：

既然分子（常数）现在是除数（常数），那么这里的一个解决方案（不需要 N 是 2 的幂）是：

Xnew=XNew*(C-X*D)*A>>M

其中 C=2*N，A 和 M 是常数（寻找除以常数的技巧）。

或（使用牛顿法）：

Xnew=XNew*(C-X*D)>>M

其中 C=2>>M 其中 M 是幂。

所以我有 2 个“*”（7.0us）、一个“-”（0.75us）和一个“>>”（0.75us？）或总共 8.5us（而不是 36us），不包括其他开销。

限制：

由于数据类型是 32 位无符号，“M”不应超过 15，否则会出现溢出问题（您可能可以使用 64 位中间数据类型解决此问题）。

N>D（否则算法会崩溃！至少使用无符号整数）

显然该算法适用于有符号和浮点数据类型）

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
int main(void)
{
  unsigned long c,d,m,x;
  // x=n/d where n=1<<m
  m=15;
  c=2<<m;
  d=10;
  x=10;
  while (true)
  {
    x=x*(c-d*x)>>m;
    printf("%ld",x);
    getchar();
  }
  return(0);
}

【讨论】：

错字：“where C=2>>M where M is power”应该是“where C=2

【解决方案4】：

尝试了许多替代方案后，我最终用汇编语言进行了正常的二进制长除法。但是，该例程确实使用了一些优化，将执行时间降低到可接受的水平。

/*
 * Converts the reference frequency count for a specific signal frequency
 * to a ratio.
 *   Xs = Ns * 2^32 / Nr
 *   Where:
 *   2^32 is a constant scaling so that the maximum accuracy can be achieved.
 *   Ns is the number of signal counts (fixed at 0x4000 by hardware).
 *   Nr is the number of reference counts, passed in W1:W0.
 * @param  W1:W0    The number of reference frequency pulses.
 * @return W1:W0    The scaled ratio.
 */
    .align  2
    .global _signalToReferenceRatio
    .type   _signalToReferenceRatio, @function

    ; This is the position of the most significant bit of the fixed Ns (0x4000).
    .equ    LOG2_DIVIDEND,  14
    .equ    DIVISOR_LIMIT,  LOG2_DIVIDEND+1
    .equ    WORD_SIZE,      16

_signalToReferenceRatio:
    ; Create a dividend, MSB-aligned with the divisor, in W2:W3 and place the
    ; number of iterations required for the MSW in [W14] and the LSW in [W14+2].
    LNK     #4
    MUL.UU  W2, #0, W2
    FF1L    W1, W4
    ; If MSW is zero the argument is out of range.
    BRA     C, .returnZero
    SUBR    W4, #WORD_SIZE, W4
    ; Find the number of quotient MSW loops.
    ; This is effectively 1 + log2(dividend) - log2(divisor).
    SUBR    W4, #DIVISOR_LIMIT, [W14]
    BRA     NC, .returnZero
    ; Since the SUBR above is always non-negative and the C flag set, use this
    ; to set bit W3<W5> and the dividend in W2:W3 = 2^(16+W5) = 2^log2(divisor).
    BSW.C   W3, W4
    ; Use 16 quotient LSW loops.
    MOV     #WORD_SIZE, W4
    MOV     W4, [W14+2]

    ; Set up W4:W5 to hold the divisor and W0:W1 to hold the result.
    MOV.D   W0, W4
    MUL.UU  W0, #0, W0

.checkLoopCount:
    ; While the bit count is non-negative ...
    DEC     [W14], [W14]
    BRA     NC,  .nextWord

.alignQuotient:
    ; Shift the current quotient word up by one bit.
    SL      W0, W0
    ; Subtract divisor from the current dividend part.
    SUB     W2, W4, W6
    SUBB    W3, W5, W7
    ; Check if the dividend part was less than the divisor.
    BRA     NC, .didNotDivide
    ; It did divide, so set the LSB of the quotient.
    BSET    W0, #0
    ; Shift the remainder up by one bit, with the next zero in the LSB.
    SL      W7, W3
    BTSC    W6, #15
    BSET    W3, #0
    SL      W6, W2
    BRA     .checkLoopCount
.didNotDivide:
    ; Shift the next (zero) bit of the dividend into the LSB of the remainder.
    SL      W3, W3
    BTSC    W2, #15
    BSET    W3, #0
    SL      W2, W2
    BRA     .checkLoopCount

.nextWord:
    ; Test if there are any LSW bits left to calculate.
    MOV     [++W14], W6
    SUB     W6, #WORD_SIZE, [W14--]
    BRA     NC, .returnQ
    ; Decrement the remaining bit counter before writing it back.
    DEC     W6, [W14]
    ; Move the working part of the quotient up into the MSW of the result.
    MOV     W0, W1
    BRA     .alignQuotient

.returnQ:
    ; Return the quotient in W0:W1.
    ULNK
    RETURN

.returnZero:
    MUL.UU  W0, #0, W0
    ULNK
    RETURN
.size   _signalToReferenceRatio, .-_signalToReferenceRatio

【讨论】：