I tried millianw's answer and it gave me a 4.5x speedup, so that's awesome.
However, the original article that millianw links to computes a sine, not a cosine, and it does it somewhat differently. (It looks simpler.)
Predictably, after 15 years the URL of that article (http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648) gives a 404 today, so I fetched it via archive.org and I am adding it here for posterity.
Unfortunately, even though the article contains several images, only the first 2 were stored by archive.org. Also, the profile page of the author (http://forum.devmaster.net/users/Nick) was not stored, so I guess we will never know who Nick is.
==================================================
Fast and accurate sine/cosine
Nick
Apr '06
Hi all,
In some situations you need a good approximation of sine and cosine that runs at very high performance. One example is to implement dynamic subdivision of round surfaces, comparable to those in Quake 3. Or to implement wave motion, in case there are no vertex shaders 2.0 available.
The standard C sinf() and cosf() functions are terribly slow and offer much more precision than what we actually need. What we really want is an approximation that offers the best compromise between precision and performance. The most well known approximation method is to use a Taylor series about 0 (also known as a Maclaurin series), which for sine becomes:
x - 1/6 x^3 + 1/120 x^5 - 1/5040 x^7 + ...
When we plot this we get: taylor.gif.

The green line is the real sine, the red line is the first four terms of the Taylor series. This seems like an acceptable approximation, but lets have a closer look: taylor_zoom.gif.

It behaves very nicely up till pi/2, but after that it quickly deviates. At pi, it evaluates to -0.075 instead of 0. Using this for wave simulation will result in jerky motion which is unacceptable.
We could add another term, which in fact reduces the error significantly, but this makes the formula pretty lengthy. We already need 7 multiplications and 3 additions for the 4-term version. The taylor series just can't give us the precision and performance we're looking for.
We did however note that we need sine(pi) = 0. And there's another thing we can see from taylor_zoom.gif: this looks very much like a parabola! So let's try to find the formula of a parabola that matches it as closely as possible. The generic formula for a parabola is A + B x + C x\^2. So this gives us three degrees of freedom. Obvious choises are that we want sine(0) = 0, sine(pi/2) = 1 and sine(pi) = 0. This gives us the following three equations:
A + B 0 + C 0^2 = 0
A + B pi/2 + C (pi/2)^2 = 1
A + B pi + C pi^2 = 0
Which has as solution A = 0, B = 4/pi, C = -4/pi\^2. So our parabola approximation becomes 4/pi x - 4/pi\^2 x\^2. Plotting this we get: parabola.gif. This looks worse than the 4-term Taylor series, right? Wrong! The maximum absolute error is 0.056. Furthermore, this approximation will give us smooth wave motion, and can be calculated in only 3 multiplications and 1 addition!
Unfortunately it's not very practical yet. This is what we get in the [-pi, pi] range: negative.gif. It's quite obvious we want at least a full period. But it's also clear that it's just another parabola, mirrored around the origin. The formula for it is 4/pi x + 4/pi\^2 x\^2. So the straightforward (pseudo-C) solution is:
if(x > 0)
{
y = 4/pi x - 4/pi^2 x^2;
}
else
{
y = 4/pi x + 4/pi^2 x^2;
}
Adding a branch is not a good idea though. It makes the code significantly slower. But look at how similar the two parts really are. A subtraction becomes an addition depending on the sign of x. In a naive first attempt to eliminate the branch we can 'extract' the sign of x using x / abs(x). The division is very expensive but look at the resulting formula: 4/pi x - x / abs(x) 4/pi\^2 x\^2. By inverting the division we can simplify this to a very nice and clean 4/pi x - 4/pi\^2 x abs(x). So for just one extra operation we get both halves of our sine approximation! Here's the graph of this formula confirming the result: abs.gif.
Now let's look at cosine. Basic trigonometry tells us that cosine(x) = sine(pi/2 + x). Is that all, add pi/2 to x? No, we actually get the unwanted part of the parabola again: shift_sine.gif. What we need to do is 'wrap around' when x > pi/2. This can be done by subtracting 2 pi. So the code becomes:
x += pi/2;
if(x > pi) // Original x > pi/2
{
x -= 2 * pi; // Wrap: cos(x) = cos(x - 2 pi)
}
y = sine(x);
Yet another branch. To eliminate it, we can use binary logic, like this:
x -= (x > pi) & (2 * pi);
Note that this isn't valid C code at all. But it should clarify how this can work. When x > pi is false the & operation zeroes the right hand part so the subtraction does nothing, which is perfectly equivalent. I'll leave it as an excercise to the reader to create working C code for this (or just read on). Clearly the cosine requires a few more operations than the sine but there doesn't seem to be any other way and it's still extremely fast.
Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series still has a smaller error on average. Recall what our sine looked like: abs.gif. So isn't there anything we can do to further increase precision at minimal cost? Of course the current version is already applicable for many situations where something that looks like a sine is just as good as the real sine. But for other situations that's just not good enough.
Looking at the graphs you'll notice that our approximation always overestimates the real sine, except at 0, pi/2 and pi. So what we need is to 'scale it down' without touching these important points. The solution is to use the squared parabola, which looks like this: squared.gif. Note how it preserves those important points but it's always lower than the real sine. So we can use a weighted average of the two to get a better approximation:
Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
With Q + P = 1. You can use exact minimization of either the absolute or relative error but I'll save you the math. The optimal weights are Q = 0.775, P = 0.225 for the absolute error and Q = 0.782, P = 0.218 for the relative error. I will use the former. The resulting graph is: average.gif. Where did the red line go? It's almost entirely covered by the green line, which instantly shows how good this approximation really is. The maximum error is about 0.001, a 50x improvement! The formula looks lengthy but the part between parenthesis is the same value from the parabola, which needs to be computed only once. In fact only 2 extra multiplications and 2 additions are required to achieve this precision boost.
It shouldn't come as a big surprise that to make it work for negative x as well we need a second abs() operation. The final C code for the sine becomes:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
float y = B * x + C * x * abs(x);
#ifdef EXTRA_PRECISION
const float P = 0.225;
y = P * (y * abs(y) - y) + y;
#endif
}
So we need merely 5 multiplications and 3 additions; still faster than the 4-term Taylor if we neglect the abs(), and much more precise! The cosine version just needs the extra shift and wrap operations on x.
Last but not least, I wouldn't be Nick if I didn't include the SIMD optimized assembly version. It allows to perform the wrap operation very efficiently so I'll give you the cosine:
// cos(x) = sin(x + pi/2)
addps xmm0, PI_2
movaps xmm1, xmm0
cmpnltps xmm1, PI
andps xmm1, PIx2
subps xmm0, xmm1
// Parabola
movaps xmm1, xmm0
andps xmm1, abs
mulps xmm1, xmm0
mulps xmm0, B
mulps xmm1, C
addps xmm0, xmm1
// Extra precision
movaps xmm1, xmm0
andps xmm1, abs
mulps xmm1, xmm0
subps xmm1, xmm0
mulps xmm1, P
addps xmm0, xmm1
This code computes four cosines in parallel, resulting in a peak performance of about 9 clock cycles per cosine for most CPU architectures. Sines take only 6 clock cycles ideally. The lower precision version can even run at 3 clock cycles per sine... And lets not forget that all input between -pi and pi is valid and the formula is exact at -pi, -pi/2, 0, pi/2 and pi.
So, the conclusion is don't ever again use a Taylor series to approximate a sine or cosine! To add a useful discussion to this article, I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.
Cheers,
Nick
==================================================
You might also find the comments that follow this article interesting, by visiting the web archive page:
http://web.archive.org/web/20141220225551/http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648