【问题标题】：Range Reduction Poor Precision For Single Precision Floating Point单精度浮点的范围缩小精度差
【发布时间】：2012-03-14 11:25:39
【问题描述】：

我正在尝试将范围缩小作为实现正弦函数的第一步。

我按照论文"ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG中描述的方法进行

当使用从 0 到 20000 的 x 的输入范围时，我得到的错误高达 0.002339146。我的错误显然不应该那么大，我不确定如何减少它。我注意到误差幅度与余弦/正弦的输入 theta 幅度相关。

我能够获得论文提到的 nearpi.c 代码，但我不确定如何将该代码用于单精度浮点。如果有人有兴趣，可以在此链接找到 nearpi.c 文件：nearpi.c

这是我的 MATLAB 代码：

x = 0:0.1:20000;

% Perform range reduction
% Store constant 2/pi
twooverpi = single(2/pi);

% Compute y
y = (x.*twooverpi);

% Compute k (round to nearest integer
k = round(y);

% Solve for f
f = single(y-k);

% Solve for r
r = single(f*single(pi/2));

% Find last two bits of k
n = bitand(fi(k,1,32,0),fi(3,1,32,0));
n = single(n);

% Preallocate for speed
z(length(x)) = 0;
for i = 1:length(x)

    switch(n(i))
        case 0
            z(i)=sin(r(i));
        case 1
            z(i) = single(cos(r(i)));
        case 2
            z(i) = -sin(r(i));
        case 3
            z(i) = single(-cos(r(i)));
        otherwise
    end

end

maxerror = max(abs(single(z - single(sin(single(x))))))
minerror = min(abs(single(z - single(sin(single(x))))))

我已经编辑了程序 nearpi.c 以便它编译。但是我不确定如何解释输出。该文件还需要一个输入，我必须手动输入，我也不确定输入的重要性。

这是工作的 nearpi.c：

/*
 ============================================================================
 Name        : nearpi.c
 Author      : 
 Version     :
 Copyright   : Your copyright notice
 Description : Hello World in C, Ansi-style
 ============================================================================
 */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>


/*
 * Global macro definitions.
 */

# define hex( double )  *(1 + ((long *) &double)), *((long *) &double)
# define sgn(a)         (a >= 0 ? 1 : -1)
# define MAX_k          2500
# define D              56
# define MAX_EXP        127
# define THRESHOLD      2.22e-16

/*
 *  Global Variables
 */

int     CFlength,               /* length of CF including terminator */
        binade;
double  e,
        f;                      /* [e,f] range of D-bit unsigned int of f;
                                   form 1X...X */

// Function Prototypes
int dbleCF (double i[], double j[]);
void input (double i[]);
void nearPiOver2 (double i[]);


/*
 *  This is the start of the main program.
 */

int main (void)
{
    int     k;                  /* subscript variable */
    double  i[MAX_k],
            j[MAX_k];           /* i and j are continued fractions
                                   (coeffs) */


   // fp = fopen("/src/cfpi.txt", "r");


/*
 *  Compute global variables e and f, where
 *
 *      e = 2 ^ (D-1), i.e. the D bit number 10...0
 *  and
 *      f = 2 ^ D - 1, i.e. the D bit number 11...1  .
 */

    e = 1;
    for (k = 2; k <= D; k = k + 1)
        e = 2 * e;
    f = 2 * e - 1;

 /*
  *  Compute the continued fraction for  (2/e)/(pi/2)  , i.e.
  *  q's starting value for the first binade, given the continued
  *  fraction for  pi  as input; set the global variable CFlength
  *  to the length of the resulting continued fraction (including
  *  its negative valued terminator).  One should use as many
  *  partial coefficients of  pi  as necessary to resolve numbers
  *  of the width of the underflow plus the overflow threshold.
  *  A rule of thumb is 0.97 partial coefficients are generated
  *  for every decimal digit of  pi .
  *
  *  Note: for radix B machines, subroutine  input  should compute
  *  the continued fraction for  (B/e)/(pi/2)  where  e = B ^ (D - 1).
  */

    input (i);

/*
 *  Begin main loop over all binades:
 *  For each binade, find the nearest multiples of pi/2 in that binade.
 *
 *  [ Note: for hexadecimal machines ( B = 16 ), the rest of the main
 *  program simplifies(!) to
 *
 *                      B_ade = 1;
 *                      while (B_ade < MAX_EXP)
 *                      {
 *                          dbleCF (i, j);
 *                          dbleCF (j, i);
 *                          dbleCF (i, j);
 *                          CFlength = dbleCF (j, i);
 *                          B_ade = B_ade + 1;
 *                      }
 *                  }
 *
 *  because the alternation of source & destination are no longer necessary. ]
 */

    binade = 1;
    while (binade < MAX_EXP)
    {

/*
 *  For the current (odd) binade, find the nearest multiples of pi/2.
 */

        nearPiOver2 (i);

/*
 *  Double the continued fraction to get to the next (even) binade.
 *  To save copying arrays, i and j will alternate as the source
 *  and destination for the continued fractions.
 */

        CFlength = dbleCF (i, j);
        binade = binade + 1;

/*
 *  Check for main loop termination again because of the
 *  alternation.
 */

        if (binade >= MAX_EXP)
            break;

/*
 *  For the current (even) binade, find the nearest multiples of pi/2.
 */

        nearPiOver2 (j);

/*
 *  Double the continued fraction to get to the next (odd) binade.
 */

        CFlength = dbleCF (j, i);
        binade = binade + 1;
    }

    return 0;
}                               /* end of Main Program */

/*
 *  Subroutine  DbleCF  doubles a continued fraction whose partial
 *  coefficients are i[] into a continued fraction j[], where both
 *  arrays are of a type sufficient to do D-bit integer arithmetic.
 *
 *  In my case ( D = 56 ) , I am forced to treat integers as double
 *  precision reals because my machine does not have integers of
 *  sufficient width to handle D-bit integer arithmetic.
 *
 *  Adapted from a Basic program written by W. Kahan.
 *
 *  Algorithm based on Hurwitz's method of doubling continued
 *  fractions (see Knuth Vol. 3, p.360).
 *
 *  A negative value terminates the last partial quotient.
 *
 *  Note:  for the non-C programmers, the statement  break
 *  exits a loop and the statement  continue  skips to the next
 *  case in the same loop.
 *
 *  The call  modf ( l / 2, &l0 )  assigns the integer portion of
 *  half of L to L0.
 */

int dbleCF (double i[], double j[])
{
      double k,
                    l,
                    l0,
                    j0;
      int    n,
                    m;
    n = 1;
    m = 0;
    j0 = i[0] + i[0];
    l = i[n];
    while (1)
    {
        if (l < 0)
        {
            j[m] = j0;
            break;
        };
        modf (l / 2, &l0);
        l = l - l0 - l0;
        k = i[n + 1];
        if (l0 > 0)
        {
            j[m] = j0;
            j[m + 1] = l0;
            j0 = 0;
            m = m + 2;
        };
        if (l == 0) {
/*
 *  Even case.
 */
            if (k < 0)
            {
                m = m - 1;
                break;
            }
            else
            {
                j0 = j0 + k + k;
                n = n + 2;
                l = i[n];
                continue;
            };
        }
/*
 *  Odd case.
 */
        if (k < 0)
        {
            j[m] = j0 + 2;
            break;
        };
        if (k == 0)
        {
            n = n + 2;
            l = l + i[n];
            continue;
        };
        j[m] = j0 + 1;
        m = m + 1;
        j0 = 1;
        l = k - 1;
        n = n + 1;
        continue;
    };
    m = m + 1;
    j[m] = -99999;
    return (m);
}

/*
 *  Subroutine  input  computes the continued fraction for
 *  (2/e) / (pi/2) , where  e = 2 ^ (D-1) , given  pi 's
 *  continued fraction as input.  That is, double the continued
 *  fraction of  pi   D-3  times and place a zero at the front.
 *
 *  One should use as many partial coefficients of  pi  as
 *  necessary to resolve numbers of the width of the underflow
 *  plus the overflow threshold.  A rule of thumb is  0.97
 *  partial coefficients are generated for every decimal digit
 *  of  pi .  The last coefficient of  pi  is terminated by a
 *  negative number.
 *
 *  I'll be happy to supply anyone with the partial coefficients
 *  of  pi .  My ARPA address is  mcdonald@ucbdali.BERKELEY.ARPA .
 *
 *  I computed the partial coefficients of  pi  using a method of
 *  Bill Gosper's.  I need only compute with integers, albeit
 *  large ones.  After writing the program in  bc  and  Vaxima  ,
 *  Prof. Fateman suggested  FranzLisp .  To my surprise, FranzLisp
 *  ran the fastest!  the reason?   FranzLisp's  Bignum  package is
 *  hand coded in assembler.  Also,  FranzLisp  can be compiled.
 *
 *
 *  Note: for radix B machines, subroutine  input  should compute
 *  the continued fraction for  (B/e)/(pi/2)  where  e = B ^ (D - 1).
 *  In the case of hexadecimal ( B = 16 ), this is done by repeated
 *  doubling the appropriate number of times.
 */

void input (double i[])
{
    int     k;
    double  j[MAX_k];

/*
 *  Read in the partial coefficients of  pi  from a precalculated file
 *  until a negative value is encountered.
 */

    k = -1;
    do
    {
        k = k + 1;
        scanf ("%lE", &i[k]);
        printf("hello\n");
        printf("%d", k);
    } while (i[k] >= 0);

/*
 *  Double the continued fraction for  pi  D-3  times using
 *  i  and  j  alternately as source and destination.  On my
 *  machine  D = 56  so  D-3  is odd; hence the following code:
 *
 *  Double twice  (D-3)/2  times,
 */
    for (k = 1; k <= (D - 3) / 2; k = k + 1)
    {
        dbleCF (i, j);
        dbleCF (j, i);
    };
/*
 *  then double once more.
 */
    dbleCF (i, j);

/*
 *  Now append a zero on the front (reciprocate the continued
 *  fraction) and the return the coefficients in  i .
 */

    i[0] = 0;
    k = -1;
    do
    {
        k = k + 1;
        i[k + 1] = j[k];
    } while (j[k] >= 0);

/*
 *  Return the length of the continued fraction, including its
 *  terminator and initial zero, in the global variable CFlength.
 */

    CFlength = k;
}

/*
 *  Given a continued fraction's coefficients in an array  i ,
 *  subroutine  nearPiOver2  finds all machine representable
 *  values near a integer multiple of  pi/2  in the current binade.
 */

void nearPiOver2 (double i[])
{
    int     k,                  /* subscript for recurrences    (see
                                   handout) */
            K;                  /* like  k , but used during cancel. elim.
                                   */
    double  p[MAX_k],           /* product of the q's           (see
                                   handout) */
            q[MAX_k],           /* successive tail evals of CF  (see
                                   handout) */
            j[MAX_k],           /* like convergent numerators   (see
                                   handout) */
            tmp,                /* temporary used during cancellation
                                   elim. */
            mk0,                /* m[k - 1]                     (see
                                   handout) */
            mk,                 /* m[k] is one of the few ints  (see
                                   handout) */
            mkAbs,              /* absolute value of m sub k
                                */
            mK0,                /* like  mk0 , but used during cancel.
                                   elim. */
            mK,                 /* like  mk  , but used during cancel.
                                   elim. */
            z,                  /* the object of our quest (the argument)
                                */
            m0,                 /* the mantissa of z as a D-bit integer
                                */
            x,                  /* the reduced argument         (see
                                   handout) */
            ldexp (),           /* sys routine to multiply by a power of
                                   two  */
            fabs (),            /* sys routine to compute FP absolute
                                   value   */
            floor (),           /* sys routine to compute greatest int <=
                                   value   */
            ceil ();            /* sys routine to compute least int >=
                                   value   */

 /*
  *  Compute the q's by evaluating the continued fraction from
  *  bottom up.
  *
  *  Start evaluation with a big number in the terminator position.
  */

    q[CFlength] = 1.0 + 30;

    for (k = CFlength - 1; k >= 0; k = k - 1)
        q[k] = i[k] + 1 / q[k + 1];

/*
 *  Let  THRESHOLD  be the biggest  | x |  that we are interesed in
 *  seeing.
 *
 *  Compute the p's and j's by the recurrences from the top down.
 *
 *  Stop when
 *
 *        1                          1
 *      -----   >=  THRESHOLD  >   ------    .
 *      2 |j |                     2 |j  |
 *          k                          k+1
 */

    p[0] = 1;
    j[0] = 0;
    j[1] = 1;
    k = 0;
    do
    {
        p[k + 1] = -q[k + 1] * p[k];
        if (k > 0)
            j[1 + k] = j[k - 1] - i[k] * j[k];
        k = k + 1;
    } while (1 / (2 * fabs (j[k])) >= THRESHOLD);

/*
 *  Then  mk  runs through the integers between
 *
 *                  k        +                   k        +
 *              (-1)  e / p  -  1/2     &    (-1)  f / p  -  1/2  .
 *                         k                            k
 */

    for (mkAbs = floor (e / fabs (p[k]));
            mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1)
    {

        mk = mkAbs * sgn (p[k]);

/*
 *  For each  mk ,  mk0  runs through integers between
 *
 *                    +
 *              m  q  -  p  THRESHOLD  .
 *               k  k     k
 */

        for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD);
                mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD);
                mk0 = mk0 + 1)
        {

/*
 *  For each pair  { mk ,  mk0 } , check that
 *
 *                             k
 *              m       =  (-1)  ( j   m  - j  m   )
 *               0                  k-1 k    k  k-1
 */
            m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0);

/*
 *  lies between  e  and  f .
 */
            if (e <= fabs (m0) && fabs (m0) <= f)
            {

/*
 *  If so, then we have found an
 *
 *                              k
 *              x       =  ((-1)  m  / p  - m ) / j
 *                                 0    k    k     k
 *
 *                      =  ( m  q  - m   ) / p  .
 *                            k  k    k-1     k
 *
 *  But this later formula can suffer cancellation.  Therefore,
 *  run the recurrence for the  mk 's  to get  mK  with minimal
 *   | mK | + | mK0 |  in the hope  mK  is  0  .
 */
                K = k;
                mK = mk;
                mK0 = mk0;
                while (fabs (mK) > 0)
                {
                    p[K + 1] = -q[K + 1] * p[K];
                    tmp = mK0 - i[K] * mK;
                    if (fabs (tmp) > fabs (mK0))
                        break;
                    mK0 = mK;
                    mK = tmp;
                    K = K + 1;
                };

/*
 *  Then
 *              x       =  ( m  q  - m   ) / p
 *                            K  K    K-1     K
 *
 *  as accurately as one could hope.
 */
                x = (mK * q[K] - mK0) / p[K];

/*
 *  To return  z  and  m0  as positive numbers,
 *   x  must take the sign of  m0  .
 */
                x = x * sgn (m0);
                m0 = fabs (m0);

/*d
 *  Set  z = m0 * 2 ^ (binade+1-D) .
 */
                z = ldexp (m0, binade + 1 - D);

/*
 *  Print  z (hex),  z (dec),  m0 (dec),  binade+1-D,  x (hex), x (dec).
 */

                printf ("%08lx %08lx    Z=%22.16E    M=%17.17G    L+1-%d=%3d    %08lx %08lx    x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x);

            }
        }
    }
}

【问题讨论】：

IIRC 对三角函数的精确范围缩减要求 pi 和其余部分都具有很高的准确性；数学库中通常使用数百位。
+1 -- 有趣的论文（刚好有时间略读，以后要仔细阅读）
@starbox：我以前读过，是的。我不熟悉用matlab做高精度算术，所以我不会建议你怎么做。
输入包含 pi 的连分数，这是计算具有上述精度的所有内容所必需的，这在第 2.3 节中提到并作为参考文献引用。 [6]。因此，您可能希望获得该出版物并编写一个小程序来生成 nearpi.c 的输入文件。也许wolfram 在这里有帮助。
如果我使用给定系列的数字作为输入，nearpi.c 会按预期工作。 input file，-1 表示连分数输入结束。用法：cat confrac.txt | ./a.out,here：可以得到连分数系列的前20000个。

标签： c++ c algorithm matlab floating-point

【解决方案1】：

理论

首先让我们注意使用单精度算术的区别。

[公式 8] f 的最小值可以更大。由于双精度数是单精度数的超集，最接近 2/pi 的倍数的 single 只能比 ~2.98e-19 更远，因此固定算术中前导零的数量f 的表示形式必须最多有 61 个前导零（但可能会更少）。表示这个数量fdigits。
[9 之前的方程式] 因此，y 必须精确到 fdigits + 24（单精度中的非零有效位）+ 7（额外保护位）= @，而不是 121 位987654333@ + 31，最多 92。
[公式 9] "因此，x 的指数宽度加上2/pi 必须包含 127（single 的最大指数）+ 31 + fdigits，或 158 + @ 987654338@，最多 219 位。
[第 2.5 小节]A 的大小由二进制点之前x 中的零个数决定（并且不受移动到single 的影响），而C 的大小是确定的通过 9 之前的公式。
- 对于大的x (x>=2^24)，x 看起来像这样：[24 位，M 个零]。将它乘以A，其大小是2/pi 的第一个M 位，将得到一个整数（x 的零只会将所有内容转换为整数）。
- 选择C 从2/pi 的M+d 位开始将导致产品x*C 的大小最多为d-24。在双精度中，d 选择为 174（而不是 24，我们有 53），因此产品的大小最多为 121。在single 中，选择d 就足够了，这样@ 987654358@，或更准确地说，d-24 <= fdigits+31。也就是说，d 可以选择为fdigits+55，也可以最多选择 116。
- 因此，B 的大小最多应为 116 位。

因此我们面临两个问题：

计算fdigits。这涉及从链接的论文中阅读参考文献 6 并理解它。可能没那么容易。 :) 据我所知，这是唯一使用nearpi.c 的地方。
计算B，2/pi的相关位。由于M 以 127 为界，我们可以离线计算2/pi 的前 127+116 位并将它们存储在一个数组中。见Wikipedia。
计算y=x*B。这涉及将 x 乘以 116 位数字。这就是使用第 3 节的地方。块的大小选择为 24，因为 2*24 + 2（将两个 24 位数字相乘，并添加 3 个这样的数字）小于double 的精度，53（并且因为 24 除以 96）。出于类似的原因，我们可以将大小为 11 位的块用于 single 算法。

注意 - B 的技巧只适用于指数为正的数字 (x>=2^24)。

总结一下 - 首先，您必须使用double 精度解决问题。你的Matlab 代码在double 精度下也不起作用（尝试删除single 并计算sin(2^53)，因为你的twooverpi 只有53 个有效位，而不是175（无论如何，你不能直接相乘） Matlab 中如此精确的数字）。其次，该方案应该适用于single，再次，关键问题是足够精确地表示2/pi，并支持高精度数字的乘法。最后，当一切正常时，您可以尝试找出更好的fdigits 来减少您必须存储和相乘的位数。

希望我没有完全偏离 - 欢迎使用 cmets 和矛盾。

示例

例如，让我们计算sin(x)，其中x = single(2^24-1)，在有效位之后没有零（M = 0）。这简化了查找B，因为B 由2/pi 的前116 位组成。由于x 的精度为 24 位，B 的精度为 116 位，因此乘积

y = x * B

将根据需要具有 92 位精度。

链接论文中的第 3 节描述了如何以足够的精度执行此产品；在我们的例子中，相同的算法可以用于大小为 11 的块来计算 y。作为苦差事，我希望原谅我没有明确地这样做，而是依靠Matlab 的符号数学工具箱。这个工具箱为我们提供了vpa 函数，它允许我们以十进制数字指定数字的精度。所以，

vpa('2/pi', ceil(116*log10(2)))

将产生至少 116 位精度的 2/pi 近似值。因为vpa 只接受整数作为它的精度参数，我们通常不能精确地指定一个数字的二进制精度，所以我们使用次优。

以下代码根据论文计算sin(x)，精度为single：

x = single(2^24-1);
y = x *  vpa('2/pi', ceil(116*log10(2)));    % Precision = 103.075
k = round(y);
f = single(y - k);
r = f * single(pi) / 2;
switch mod(k, 4)
    case 0 
        s = sin(r);
    case 1
        s = cos(r);
    case 2
        s = -sin(r);
    case 3
        s = -cos(r);
end
sin(x) - s                                   % Expected value: exactly zero.

（y 的精度是使用Mathematica 获得的，结果证明这是一个比Matlab 更好的数值工具:)）

在`libm`

这个问题的另一个答案（此后已被删除）将我带到libm 中的一个实现，虽然它适用于双精度数字，但非常彻底地遵循了链接的论文。

请参阅文件 s_sin.c 以获取包装器（链接文件中的表 2 显示为文件末尾的 switch 语句），e_rem_pio2.c 用于参数缩减代码（特别感兴趣的是包含2/pi 的前 396 个十六进制数字，从第 69 行开始）。

【讨论】：

我知道有问题需要解决。但是我仍然没有看到基于您所写内容的解决方案。你能把你留下的空白填上吗？感谢您的宝贵时间。
我的输入可能总是在 +/- 20000 范围内（也包括小输入），但我的输入 x 总是远小于 2^24。
该代码示例应该适用于幅度小于 2^24 的所有数字。你只需要更少的2/pi 数字。到底有多少 - 请查看链接论文中的图 2。
我现在仍在关注如何计算我需要多少个 2/pi。我知道我的输入范围是多少（+/-20000）。如果我有 2^24 左右的输入，你的例子是完美的，但是给定我的输入范围，我需要 2/pi 的位数是多少？
我相信答案已经在这里解决了：springerlink.com/content/h886863382gg4u83/fulltext.pdf（如果您无法查看此链接，请告诉我）

理论

示例

在libm

在`libm`