3
votes

In some applications, sine and cosine of multiple angles are needed, where the angles are derived by repeated addition of equal-sized increments incr to a starting value base. For performance reasons, instead of invoking the sin(), cos() standard math library functions (or possibly the non-standard sincos() function) for each generated angle, it can be very advantageous to compute sin(base) and cos(base) only once, then derive all other sines and cosines by application of angle-sum formulas:

sin(base+incr) = cos(incr) · sin(base) + sin(incr) · cos(base)
cos(base+incr) = cos(incr) · cos(base) - sin(incr) · sin(base)

This only requires the one-time precomputation of the scale factors sin(incr) and cos(incr), regardless of how many iterations are performed.

There are a couple of issues with this approach. If the increment is small, cos(incr) will be a number near unity, causing a loss of accuracy through implicit subtractive cancellation when computing in a finite precision floating-point format. Also, more rounding error than necessary is incurred because the computation is not arranged in the numerical advantageous form sin(base+incr) = sin(base) + adjust , where the computed quantity adjust is significantly smaller in magnitude than sin(base) (analogous for the cosine).

Since one usually applies tens to hundreds of iteration steps, these errors will accumulate. How can one structure the iterative computation in a manner most advantageous to maintaining high accuracy? What changes to the algorithm should be made if the fused multiply-add operation (FMA) is available, which is exposed via the standard math functions fma() and fmaf()?

3
When re-writing standard functions to take advantage of speed, often there is a loss of accuracy or loss of range or portability. Yet the importance of speed, range, portability vs. accuracy are not the same dimension and so we are left with comparing unlike qualities. IOWs, how is fast and accurate rated together? double my_sin(double x) { return x; } is fast, portability & high precision over maybe 25% of all double (the small ones) yet lousy with large values.chux - Reinstate Monica

3 Answers

4
votes

The application of the half-angle formula for the sine allows to address both of the accuracy-impacting issues mentioned in the question:

sin(incr/2) = √((1-cos(incr))/2) ⇒
sin²(incr/2) = (1-cos(incr))/2 ⇔
2·sin²(incr/2) = 1-cos(incr) ⇔
1-2·sin²(incr/2) = cos(incr)

Substituting this into the original formula results in this intermediate representation:

sin(base+incr) = (1 - 2·sin²(incr/2)) · sin(base) + sin(incr) · cos(base)
cos(base+incr) = (1 - 2·sin²(incr/2)) · cos(base) - sin(incr) · sin(base)

By a simple reordering of the terms one arrives at the desired form of the formula:

sin(base+incr) = sin(base) + (sin(incr) · cos(base) - 2·sin²(incr/2) · sin(base))
cos(base+incr) = cos(base) - (2·sin²(incr/2) · cos(base) + sin(incr) · sin(base))

As in the original formula, this only requires the one-time precomputation of two scale factors, namely 2·sin²(incr/2) and sin(incr). For small increments, both of them are small: full accuracy is retained.

There are two choices on how to apply FMA to this computation. One can either minimize the number of operations by abolishing the approach of using a single adjustment, and instead use two, hoping that the reduced rounding error of the FMA operation (one rounding for two operations) will compensate the loss in accuracy:

sin(base+incr) = fma (-2·sin²(incr/2), sin(base), fma ( sin(incr), cos(base), sin(base)))
cos(base+incr) = fma (-2·sin²(incr/2), cos(base), fma (-sin(incr), sin(base), cos(base)))

The other alternative is to apply a single FMA to the improved formula, although it is not immediately clear which of the two multiplies should be mapped to the unrounded multiplication inside the FMA:

sin(base+incr) = sin(base) + fma (sin(incr), cos(base), -2·sin²(incr/2) · sin(base))
cos(base+incr) = cos(base) - fma (sin(incr), sin(base), 2·sin²(incr/2) · cos(base))

The scaffolding below evaluates each of the computational alternative discussed above by generating many (base, incr) pairs, then iterates for each of them for a set number of steps while collecting errors of all sine and cosine values generated. From this it computes a root-mean square error for each test case, separately for sines, cosines. The largest RMS error observed across all generated test cases is reported at the end.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define NAIVE    (1)
#define ROBUST   (2)
#define FAST     (3)
#define ACCURATE (4)
#define MODE (ACCURATE)

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    double sumerrsqS, sumerrsqC, rmsS, rmsC, maxrmsS = 0, maxrmsC = 0;
    double refS, refC, errS, errC;
    float base, incr, s0, c0, s1, c1, tt;
    int count, i;

    const int N = 100;  // # rotation steps per test case
    count = 2000000;    // # test cases (a pair of base and increment values)

#if MODE == NAIVE
    printf ("testing: NAIVE (without FMA)\n");
#elif MODE == FAST
    printf ("testing: FAST (without FMA)\n");
#elif MODE == ACCURATE
    printf ("testing: ACCURATE (with FMA)\n");
#elif MODE == ROBUST
    printf ("testing: ROBUST (with FMA)\n");
#else
#error unsupported MODE
#endif // MODE

    do {
        /* generate test case */
        base = (float)(KISS * 1.21e-10);      // starting angle, < 30 degrees
        incr = (float)(KISS * 2.43e-10 / N);  // increment, < 60/n degrees

        /* set up rotation parameters */
        s1 = sinf (incr);
#if MODE == NAIVE
        c1 = cosf (incr);
#else
        tt = sinf (incr * 0.5f);
        c1 = 2.0f * tt * tt;
#endif // MODE
        sumerrsqS = 0;
        sumerrsqC = 0;

        s0 = sinf (base); // initial sine
        c0 = cosf (base); // initial cosine

        /* run test case through N rotation steps */
        i = 0;
        do {         

            tt = s0; // old sine
#if MODE == NAIVE
            /* least accurate, 6 FP ops */
            s0 = c1 * tt + s1 * c0; // new sine
            c0 = c1 * c0 - s1 * tt; // new cosine
#elif MODE == ROBUST
            /* very accurate, 8 FP ops */
            s0 = ( s1 * c0 - c1 * tt) + tt; // new sine
            c0 = (-s1 * tt - c1 * c0) + c0; // new cosine
#elif MODE == FAST
            /* accurate and fast, 4 FP ops */
            s0 = fmaf (-c1, tt, fmaf ( s1, c0, tt)); // new sine
            c0 = fmaf (-c1, c0, fmaf (-s1, tt, c0)); // new cosine
#elif MODE == ACCURATE
            /* most accurate, 6 FP ops */
            s0 = tt + fmaf (s1, c0, -c1 * tt); // new sine
            c0 = c0 - fmaf (s1, tt,  c1 * c0); // new cosine
#endif // MODE
            i++;

            refS = sin (fma ((double)i, (double)incr, (double)base));
            refC = cos (fma ((double)i, (double)incr, (double)base));
            errS = ((double)s0 - refS) / refS;
            errC = ((double)c0 - refC) / refC;
            sumerrsqS = fma (errS, errS, sumerrsqS);
            sumerrsqC = fma (errC, errC, sumerrsqC);
        } while (i < N);

        rmsS = sqrt (sumerrsqS / N);
        rmsC = sqrt (sumerrsqC / N);
        if (rmsS > maxrmsS) maxrmsS = rmsS;
        if (rmsC > maxrmsC) maxrmsC = rmsC;

    } while (--count);

    printf ("max rms error sin = % 16.9e\n", maxrmsS);
    printf ("max rms error cos = % 16.9e\n", maxrmsC);

    return EXIT_SUCCESS;
}

The output of the test scaffold shows that the fastest FMA-based alternative is superior to the naive method from the question, while the more accurate FMA-based alternative is the most accurate out of the alternatives considered:

testing: NAIVE (without FMA)
max rms error sin =  4.837386842e-006
max rms error cos =  6.884047862e-006

testing: ROBUST (without FMA)
max rms error sin =  3.330292645e-006
max rms error cos =  4.297631502e-006

testing: FAST (with FMA)
max rms error sin =  3.532624939e-006
max rms error cos =  4.763623188e-006

testing: ACCURATE (with FMA)
max rms error sin =  3.330292645e-006
max rms error cos =  4.104813533e-006
1
votes

If you want to maximize accuracy over long iteration counts, you could incrementally compute an exact value with no accumulated error while you generate incremental results from the previous exact value.

For example, if you precompute sin(incr*2^x) and cos(incr*2^x) for x=6 ... 31, say, then you can use the angle-sum formulas to calculate the result for each incr=64*n one bit at a time while you output the previous 64 values.

Every 64 values you discard the incrementally generated result in favor of the exact one, so no error can accumulate over long periods.

Also, since you're only going to need 64 incremental results from any exact base, you can precompute the 64 sines and cosines required to calculate those results directly from the base instead of the previous result.

0
votes

It is possible to rearrange the equations for sin(base+incr) and cos(base+incr) in the following way:

sin(base+incr) = cos(incr) · sin(base) + sin(incr) · cos(base)
sin(base+incr) = sin(base) + (1 - cos(incr)) · -sin(base) + sin(incr) · cos(base)
sin(base+incr) = sin(base) + sin(incr) · (-1 / sin(incr) · (1 - cos(incr)) · sin(base) + cos(base))
sin(base+incr) = sin(base) + sin(incr) · (-tan(incr/2) · sin(base) + cos(base))

cos(base+incr) = cos(incr) · cos(base) - sin(incr) · sin(base)
cos(base+incr) = cos(base) - sin(incr) · (tan(incr/2) · cos(base) + sin(base))

Here we use the formula (1-cos(x)/sin(x) = tan(x/2), see here, for example. It is not immediately obvious that this should lead to more accurate results than the other approaches, but in practice it works quite well, as we will see later.

Again, this requires the one-time precomputation of two scale factors sin(incr) and tan(incr/2). In C we can write the formulas with 4 fma-s:

        s0 = fmaf ( s1, fmaf (-tt, c1, c0), tt); // new sine
        c0 = fmaf (-s1, fmaf ( c0, c1, tt), c0); // new cosine

The full updated test code is at the end of this answer. With gcc -O3 -Wall -m64 -march=skylake fastsincos.c -lm (GCC version 7.3), the results are:

testing: FAST (with FMA)
max rms error sin =  3.532624939e-06
max rms error cos =  4.763623188e-06

testing: ACCURATE (with FMA)
max rms error sin =  3.330292645e-06
max rms error cos =  4.104813533e-06

testing: FAST_ACC (with FMA)
max rms error sin =  3.330292645e-06
max rms error cos =  3.775300478e-06

The new solution FAST_ACC is indeed a bit more accurate than the others in this test.


Modified test code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define NAIVE    (1)
#define ROBUST   (2)
#define FAST     (3)
#define ACCURATE (4)
#define FAST_ACC (5)
#define MODE (FAST_ACC)

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    double sumerrsqS, sumerrsqC, rmsS, rmsC, maxrmsS = 0, maxrmsC = 0;
    double refS, refC, errS, errC;
    float base, incr, s0, c0, s1, c1, tt;
    int count, i;

    const int N = 100;  // # rotation steps per test case
    count = 2000000;    // # test cases (a pair of base and increment values)

#if MODE == NAIVE
    printf ("testing: NAIVE (without FMA)\n");
#elif MODE == FAST
    printf ("testing: FAST (without FMA)\n");
#elif MODE == ACCURATE
    printf ("testing: ACCURATE (with FMA)\n");
#elif MODE == ROBUST
    printf ("testing: ROBUST (with FMA)\n");
#elif MODE == FAST_ACC
    printf ("testing: FAST_ACC (with FMA)\n");
#else
#error unsupported MODE
#endif // MODE

    do {
        /* generate test case */
        base = (float)(KISS * 1.21e-10);      // starting angle, < 30 degrees
        incr = (float)(KISS * 2.43e-10 / N);  // increment, < 60/n degrees

        /* set up rotation parameters */
        s1 = sinf (incr);
#if MODE == NAIVE
        c1 = cosf (incr);
#elif MODE == FAST_ACC
        c1 = tanf (incr * 0.5f);
#else
        tt = sinf (incr * 0.5f);
        c1 = 2.0f * tt * tt;
#endif // MODE
        sumerrsqS = 0;
        sumerrsqC = 0;

        s0 = sinf (base); // initial sine
        c0 = cosf (base); // initial cosine

        /* run test case through N rotation steps */
        i = 0;
        do {         

            tt = s0; // old sine
#if MODE == NAIVE
            /* least accurate, 6 FP ops */
            s0 = c1 * tt + s1 * c0; // new sine
            c0 = c1 * c0 - s1 * tt; // new cosine
#elif MODE == ROBUST
            /* very accurate, 8 FP ops */
            s0 = ( s1 * c0 - c1 * tt) + tt; // new sine
            c0 = (-s1 * tt - c1 * c0) + c0; // new cosine
#elif MODE == FAST
            /* accurate and fast, 4 FP ops */
            s0 = fmaf (-c1, tt, fmaf ( s1, c0, tt)); // new sine
            c0 = fmaf (-c1, c0, fmaf (-s1, tt, c0)); // new cosine
#elif MODE == ACCURATE
            /* most accurate, 6 FP ops */
            s0 = tt + fmaf (s1, c0, -c1 * tt); // new sine
            c0 = c0 - fmaf (s1, tt,  c1 * c0); // new cosine
#elif MODE == FAST_ACC
            /* fast and accurate, 4 FP ops */
            s0 = fmaf ( s1, fmaf (-tt, c1, c0), tt); // new sine
            c0 = fmaf (-s1, fmaf ( c0, c1, tt), c0); // new cosine
#endif // MODE
            i++;

            refS = sin (fma ((double)i, (double)incr, (double)base));
            refC = cos (fma ((double)i, (double)incr, (double)base));
            errS = ((double)s0 - refS) / refS;
            errC = ((double)c0 - refC) / refC;
            sumerrsqS = fma (errS, errS, sumerrsqS);
            sumerrsqC = fma (errC, errC, sumerrsqC);
        } while (i < N);

        rmsS = sqrt (sumerrsqS / N);
        rmsC = sqrt (sumerrsqC / N);
        if (rmsS > maxrmsS) maxrmsS = rmsS;
        if (rmsC > maxrmsC) maxrmsC = rmsC;

    } while (--count);

    printf ("max rms error sin = % 16.9e\n", maxrmsS);
    printf ("max rms error cos = % 16.9e\n", maxrmsC);

    return EXIT_SUCCESS;
}