The application of the half-angle formula for the sine allows to address both of the accuracy-impacting issues mentioned in the question:
sin(incr/2) = √((1-cos(incr))/2) ⇒
sin²(incr/2) = (1-cos(incr))/2 ⇔
2·sin²(incr/2) = 1-cos(incr) ⇔
1-2·sin²(incr/2) = cos(incr)
Substituting this into the original formula results in this intermediate representation:
sin(base+incr) = (1 - 2·sin²(incr/2)) · sin(base) + sin(incr) · cos(base)
cos(base+incr) = (1 - 2·sin²(incr/2)) · cos(base) - sin(incr) · sin(base)
By a simple reordering of the terms one arrives at the desired form of the formula:
sin(base+incr) = sin(base) + (sin(incr) · cos(base) - 2·sin²(incr/2) · sin(base))
cos(base+incr) = cos(base) - (2·sin²(incr/2) · cos(base) + sin(incr) · sin(base))
As in the original formula, this only requires the one-time precomputation of two scale factors, namely 2·sin²(incr/2) and sin(incr). For small increments, both of them are small: full accuracy is retained.
There are two choices on how to apply FMA to this computation. One can either minimize the number of operations by abolishing the approach of using a single adjustment, and instead use two, hoping that the reduced rounding error of the FMA operation (one rounding for two operations) will compensate the loss in accuracy:
sin(base+incr) = fma (-2·sin²(incr/2), sin(base), fma ( sin(incr), cos(base), sin(base)))
cos(base+incr) = fma (-2·sin²(incr/2), cos(base), fma (-sin(incr), sin(base), cos(base)))
The other alternative is to apply a single FMA to the improved formula, although it is not immediately clear which of the two multiplies should be mapped to the unrounded multiplication inside the FMA:
sin(base+incr) = sin(base) + fma (sin(incr), cos(base), -2·sin²(incr/2) · sin(base))
cos(base+incr) = cos(base) - fma (sin(incr), sin(base), 2·sin²(incr/2) · cos(base))
The scaffolding below evaluates each of the computational alternative discussed above by generating many (base, incr) pairs, then iterates for each of them for a set number of steps while collecting errors of all sine and cosine values generated. From this it computes a root-mean square error for each test case, separately for sines, cosines. The largest RMS error observed across all generated test cases is reported at the end.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NAIVE (1)
#define ROBUST (2)
#define FAST (3)
#define ACCURATE (4)
#define MODE (ACCURATE)
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
double sumerrsqS, sumerrsqC, rmsS, rmsC, maxrmsS = 0, maxrmsC = 0;
double refS, refC, errS, errC;
float base, incr, s0, c0, s1, c1, tt;
int count, i;
const int N = 100;
count = 2000000;
#if MODE == NAIVE
printf ("testing: NAIVE (without FMA)\n");
#elif MODE == FAST
printf ("testing: FAST (without FMA)\n");
#elif MODE == ACCURATE
printf ("testing: ACCURATE (with FMA)\n");
#elif MODE == ROBUST
printf ("testing: ROBUST (with FMA)\n");
#else
#error unsupported MODE
#endif
do {
base = (float)(KISS * 1.21e-10);
incr = (float)(KISS * 2.43e-10 / N);
s1 = sinf (incr);
#if MODE == NAIVE
c1 = cosf (incr);
#else
tt = sinf (incr * 0.5f);
c1 = 2.0f * tt * tt;
#endif
sumerrsqS = 0;
sumerrsqC = 0;
s0 = sinf (base);
c0 = cosf (base);
i = 0;
do {
tt = s0;
#if MODE == NAIVE
s0 = c1 * tt + s1 * c0;
c0 = c1 * c0 - s1 * tt;
#elif MODE == ROBUST
s0 = ( s1 * c0 - c1 * tt) + tt;
c0 = (-s1 * tt - c1 * c0) + c0;
#elif MODE == FAST
s0 = fmaf (-c1, tt, fmaf ( s1, c0, tt));
c0 = fmaf (-c1, c0, fmaf (-s1, tt, c0));
#elif MODE == ACCURATE
s0 = tt + fmaf (s1, c0, -c1 * tt);
c0 = c0 - fmaf (s1, tt, c1 * c0);
#endif
i++;
refS = sin (fma ((double)i, (double)incr, (double)base));
refC = cos (fma ((double)i, (double)incr, (double)base));
errS = ((double)s0 - refS) / refS;
errC = ((double)c0 - refC) / refC;
sumerrsqS = fma (errS, errS, sumerrsqS);
sumerrsqC = fma (errC, errC, sumerrsqC);
} while (i < N);
rmsS = sqrt (sumerrsqS / N);
rmsC = sqrt (sumerrsqC / N);
if (rmsS > maxrmsS) maxrmsS = rmsS;
if (rmsC > maxrmsC) maxrmsC = rmsC;
} while (--count);
printf ("max rms error sin = % 16.9e\n", maxrmsS);
printf ("max rms error cos = % 16.9e\n", maxrmsC);
return EXIT_SUCCESS;
}
The output of the test scaffold shows that the fastest FMA-based alternative is superior to the naive method from the question, while the more accurate FMA-based alternative is the most accurate out of the alternatives considered:
testing: NAIVE (without FMA)
max rms error sin = 4.837386842e-006
max rms error cos = 6.884047862e-006
testing: ROBUST (without FMA)
max rms error sin = 3.330292645e-006
max rms error cos = 4.297631502e-006
testing: FAST (with FMA)
max rms error sin = 3.532624939e-006
max rms error cos = 4.763623188e-006
testing: ACCURATE (with FMA)
max rms error sin = 3.330292645e-006
max rms error cos = 4.104813533e-006
double my_sin(double x) { return x; }
is fast, portability & high precision over maybe 25% of alldouble
(the small ones) yet lousy with large values. – chux - Reinstate Monica