使用标准 C 数学库实现 sinpi() 和 cospi()答案

【问题标题】：Implementation of sinpi() and cospi() using standard C math library使用标准 C 数学库实现 sinpi() 和 cospi()
【发布时间】：2017-08-05 04:46:37
【问题描述】：

函数sinpi(x) 计算sin(πx)，函数cospi(x) 计算cos(πx)，其中与π 的乘法隐含在函数内部。这些函数最初是由 Sun Microsystems 在late 1980s 中作为扩展引入到 C 标准数学库中的。 IEEE Std 754™-2008 在第 9 节中指定了等效函数 sinPi 和 cosPi。

有许多计算自然发生 sin(πx) 和 cos(πx)。一个非常简单的例子是 Box-Muller 变换（GEP Box 和 Mervin E. Muller，“A Note on the Generation of Random Normal Deviates”。数理统计年鉴，第 29 卷，第 2 期。 2, pp. 610 - 611)，给定两个具有均匀分布的独立随机变量 U1 和 U2，产生具有标准正态分布的独立随机变量 Z1 和 Z2：

Z₁ = √(-2 ln U₁) cos (2 π U₂)
Z₂ = √(-2 ln U₁) sin (2 π U₂)

另一个例子是计算度数参数的正弦和余弦，如使用Haversine公式计算大圆距离：

/* This function computes the great-circle distance of two points on earth 
   using the Haversine formula, assuming spherical shape of the planet. A 
   well-known numerical issue with the formula is reduced accuracy in the 
   case of near antipodal points.

   lat1, lon1  latitude and longitude of first point, in degrees [-90,+90]
   lat2, lon2  latitude and longitude of second point, in degrees [-180,+180]
   radius      radius of the earth in user-defined units, e.g. 6378.2 km or 
               3963.2 miles

   returns:    distance of the two points, in the same units as radius

   Reference: http://en.wikipedia.org/wiki/Great-circle_distance
*/
double haversine (double lat1, double lon1, double lat2, double lon2, double radius)
{
    double dlat, dlon, c1, c2, d1, d2, a, c, t;

    c1 = cospi (lat1 / 180.0);
    c2 = cospi (lat2 / 180.0);
    dlat = lat2 - lat1;
    dlon = lon2 - lon1;
    d1 = sinpi (dlat / 360.0);
    d2 = sinpi (dlon / 360.0);
    t = d2 * d2 * c1 * c2;
    a = d1 * d1 + t;
    c = 2.0 * asin (fmin (1.0, sqrt (a)));
    return radius * c;
}

对于 C++，Boost 库提供 sin_pi 和 cos_pi，一些供应商提供sinpi 和cospi 功能作为系统库中的扩展。例如，Apple 在 iOS 7 和 OS X 10.9 中添加了__sinpi、__cospi 以及相应的单精度版本__sinpif、__cospif（presentation，幻灯片 101）。但是对于许多其他平台，没有 C 程序可以轻松访问的实现。

与使用例如的传统方法相比sin (M_PI * x) 和 cos (M_PI * x)，使用 sinpi 和 cospi 通过与 π 的乘法 internal 减少舍入误差来提高准确性，并且由于更简单的参数减少，还提供了性能优势.

如何使用标准 C 数学库以合理高效且符合标准的方式实现 sinpi() 和 cospi() 功能？

【问题讨论】：

为了同时获得最大的准确性和便携性，在我看来，暂时将舍入模式（例如使用fenv() 或fesetround()）更改为截断/舍入为零是必要的。这样我们就可以使用例如Kahan sum/compensated sum，并将高精度系数拆分为几个不同的有限精度因子。其他所有方法似乎都依赖于特定的硬件（例如 fma()，其仿真速度非常慢）或实现细节。
@NominalAnimal 我没有针对最大可移植性，因为这不是我需要的。对于想要在自己的实现中解决这些问题的人，我在回答中指出了各种潜在的症结。至于 FMA，它可以作为最近（大约过去 5 年）x86 和 ARM 处理器的硬件指令使用，当然还有自 1990 年代以来的 Power[PC]。如果有人想提供针对 FMA-less 硬件平台优化的代码的答案，我很乐意支持它（如果它真的很好，还会给予额外的奖励）。

标签： c floating-point trigonometry math.h

【解决方案1】：

为简单起见，我将重点关注sincospi()，它同时提供正弦和余弦结果。然后可以将sinpi 和cospi 构造为丢弃不需要的数据的包装函数。在许多应用程序中，浮点标志的处理（参见fenv.h）不是必需的，我们也不需要errno 大部分时间的错误报告，所以我将省略这些。

基本的算法结构很简单。由于非常大的参数总是偶数，因此是 2π 的倍数，因此它们的正弦和余弦值是众所周知的。在记录象限信息时，其他参数被折叠到 [-¼,+¼] 范围内。多项式minimax approximations 用于计算主近似区间上的正弦和余弦。最后，利用象限数据，通过循环交换结果和符号变化，将初步结果映射到最终结果。

特殊操作数（特别是 -0、无穷大和 NaN）的正确处理要求编译器仅应用符合 IEEE-754 规则的优化。它可能不会将x*0.0 转换为0.0（这对于-0、无穷大和NaN 不正确），也可能不会将0.0-x 优化为-x，因为根据5.5.1 节，否定是位级操作IEEE-754（对零和 NaN 产生不同的结果）。大多数编译器都会提供一个标志来强制使用“安全”转换，例如-fp-model=precise 用于英特尔 C/C++ 编译器。

另一个警告适用于在参数减少期间使用nearbyint 函数。和rint一样，这个函数被指定为根据当前的舍入方式进行舍入。当不使用fenv.h 时，舍入模式默认为舍入“到最近或偶数”。使用它时，存在定向舍入模式生效的风险。这可以通过使用round 来解决，它始终提供与当前舍入模式无关的舍入模式“舍入到最近，从零开始”。但是，由于大多数处理器架构上的等效机器指令不支持此功能，因此该功能往往会较慢。

关于性能的说明：下面的 C99 代码严重依赖于 fma() 的使用，它实现了 fused multiply-add 操作。在大多数现代硬件架构上，这直接由相应的硬件指令支持。如果不是这种情况，由于 FMA 仿真通常很慢，代码可能会显着变慢。

 #include <math.h>
 #include <stdint.h>

/* Writes result sine result sin(πa) to the location pointed to by sp
   Writes result cosine result cos(πa) to the location pointed to by cp

   In extensive testing, no errors > 0.97 ulp were found in either the sine
   or cosine results, suggesting the results returned are faithfully rounded.
*/
void my_sincospi (double a, double *sp, double *cp)
{
    double c, r, s, t, az;
    int64_t i;

    az = a * 0.0; // must be evaluated with IEEE-754 semantics
    /* for |a| >= 2**53, cospi(a) = 1.0, but cospi(Inf) = NaN */
    a = (fabs (a) < 9.0071992547409920e+15) ? a : az;  // 0x1.0p53
    /* reduce argument to primary approximation interval (-0.25, 0.25) */
    r = nearbyint (a + a); // must use IEEE-754 "to nearest" rounding
    i = (int64_t)r;
    t = fma (-0.5, r, a);
    /* compute core approximations */
    s = t * t;
    /* Approximate cos(pi*x) for x in [-0.25,0.25] */
    r =            -1.0369917389758117e-4;
    r = fma (r, s,  1.9294935641298806e-3);
    r = fma (r, s, -2.5806887942825395e-2);
    r = fma (r, s,  2.3533063028328211e-1);
    r = fma (r, s, -1.3352627688538006e+0);
    r = fma (r, s,  4.0587121264167623e+0);
    r = fma (r, s, -4.9348022005446790e+0);
    c = fma (r, s,  1.0000000000000000e+0);
    /* Approximate sin(pi*x) for x in [-0.25,0.25] */
    r =             4.6151442520157035e-4;
    r = fma (r, s, -7.3700183130883555e-3);
    r = fma (r, s,  8.2145868949323936e-2);
    r = fma (r, s, -5.9926452893214921e-1);
    r = fma (r, s,  2.5501640398732688e+0);
    r = fma (r, s, -5.1677127800499516e+0);
    s = s * t;
    r = r * s;
    s = fma (t, 3.1415926535897931e+0, r);
    /* map results according to quadrant */
    if (i & 2) {
        s = 0.0 - s; // must be evaluated with IEEE-754 semantics
        c = 0.0 - c; // must be evaluated with IEEE-754 semantics
    }
    if (i & 1) { 
        t = 0.0 - s; // must be evaluated with IEEE-754 semantics
        s = c;
        c = t;
    }
    /* IEEE-754: sinPi(+n) is +0 and sinPi(-n) is -0 for positive integers n */
    if (a == floor (a)) s = az;
    *sp = s;
    *cp = c;
}

单精度版本基本上只在核心近似值上有所不同。使用详尽的测试可以精确确定误差范围。

#include <math.h>
#include <stdint.h>

/* Writes result sine result sin(πa) to the location pointed to by sp
   Writes result cosine result cos(πa) to the location pointed to by cp

   In exhaustive testing, the maximum error in sine results was 0.96677 ulp,
   the maximum error in cosine results was 0.96563 ulp, meaning results are
   faithfully rounded.
*/
void my_sincospif (float a, float *sp, float *cp)
{
    float az, t, c, r, s;
    int32_t i;

    az = a * 0.0f; // must be evaluated with IEEE-754 semantics
    /* for |a| > 2**24, cospi(a) = 1.0f, but cospi(Inf) = NaN */
    a = (fabsf (a) < 0x1.0p24f) ? a : az;
    r = nearbyintf (a + a); // must use IEEE-754 "to nearest" rounding
    i = (int32_t)r;
    t = fmaf (-0.5f, r, a);
    /* compute core approximations */
    s = t * t;
    /* Approximate cos(pi*x) for x in [-0.25,0.25] */
    r =              0x1.d9e000p-3f;
    r = fmaf (r, s, -0x1.55c400p+0f);
    r = fmaf (r, s,  0x1.03c1cep+2f);
    r = fmaf (r, s, -0x1.3bd3ccp+2f);
    c = fmaf (r, s,  0x1.000000p+0f);
    /* Approximate sin(pi*x) for x in [-0.25,0.25] */
    r =             -0x1.310000p-1f;
    r = fmaf (r, s,  0x1.46737ep+1f);
    r = fmaf (r, s, -0x1.4abbfep+2f);
    r = (t * s) * r;
    s = fmaf (t, 0x1.921fb6p+1f, r);
    if (i & 2) {
        s = 0.0f - s; // must be evaluated with IEEE-754 semantics
        c = 0.0f - c; // must be evaluated with IEEE-754 semantics
    }
    if (i & 1) {
        t = 0.0f - s; // must be evaluated with IEEE-754 semantics
        s = c;
        c = t;
    }
    /* IEEE-754: sinPi(+n) is +0 and sinPi(-n) is -0 for positive integers n */
    if (a == floorf (a)) s = az;
    *sp = s;
    *cp = c;
}

【讨论】：

在您明确依赖 IEEE 754 语义的范围内，您如何解决 C 标准不需要实现的浮点表示或算术以符合 IEEE 754（完全)?
@JohnBollinger 我不知道。如果一个工具链可以根据 IEEE-754 规则对浮点格式和转换提供足够的控制，那么此代码在 IEEE-754 方面可以正常工作（最好是可以测试一下）。相反，如果工具链通常不符合 IEEE-754，则应该没有期望（我也不认为有必要）此代码符合所有要求IEEE-754 的任何一种。
出于好奇，为什么要使用十六进制浮点数和十进制双精度数？
在计算正弦的最后一步，而不是计算s = s * t; r = r * s; s = fma (t, π, r);（相当于计算s = π*t + t^3），可以将乘以t 进行因式分解，从而得到fma 和进一步的乘法就足够了：s = fma (r, s, 3.1415926535897931e+0); s = s * t.
@MatíasGiovannini 这种重新排序会导致最大 ulp 误差增加（传闻到 ~ 1.5 ulp），因此实现不再忠实地四舍五入（这是我的设计目标）。这在某些情况下可能是可以接受的。