7
votes

I am working on an architecture which does not feature floating point hardware, but only a 16 bit ALU and a 40 bit MAC.

I have already implemented 32-bit single precision floating point addition/subtraction, multiplication, cosine, sine, division, square root, and range reduction all in software on this architecture.

To implement cosine and sine I first used range reduction using the method described in the paper "ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG I then implemented a cosine and sine function which are polynomial approximations to the cosine and sine functions on the range -pi/4 to +pi/4. I referred to the book "Computer Approximations", Hart, et al. for the polynomials.

I have also heard that I should consider the CORDIC algorithm. However, I was wondering if anyone knows if it would be more or less efficient (in terms of throughput, memory overhead, and number of instructions required) than the method I already used? I have implemented my software functions on a multicore architecture where each core features only 128 words of instruction memory and a 128 word 16-bit data memory. Also I have tried searching for how to implement the CORDIC algorithm for cosine and sine, but I couldn't find any good resources for 32-bit floating point implementations. Does anybody have suggestions?

Thank you!

2
Do you have a performance problem with your existing implementation?Pieter Geerkens
The performance is pretty slow. Also I am going to compare my software implementations versus if I had a hardware floating-point multiplier or adder or both available to me. One question I have always had on my mind is, well what about CORDIC, and how does it come into play here? Should I disregard it as an option? Why? I haven't been able to find any good resources to answer this question for 32-bit floating point.Veridian
Did you use Horner to evaluate the polynomials ? Which degree polynomials did you use ?bruce_ricard
My sine equation is of the form ax+bx^3+cx^5+dx^7. My cosine equation is of the form a+bx^2+cx^4+dx^6. I am not sure what you mean by using Horner? I found the coefficients a, b, c, and d by using the Computer Approximations book. I used a polynomial which guarantees me accuracy to 24 bits.Veridian
For Horner's method compute a+x*(b+x*(c+x*d)) for cos and similar for sin. Note, wikipedia says “when a hardware multiplier is available (e.g., in a DSP microprocessor), table-lookup methods and power series are generally faster than CORDIC”.James Waldby - jwpat7

2 Answers

6
votes

CORDIC gives you one bit per loop iteration, so implementing it in software will likely be slower than your polynomial version. That may also be why it is hard to find articles on software implementations of CORDIC: its performance is inferior, so nobody bothers.

Re your comment: Horner's method is the practice of evaluating polynomials from highest-order coefficient to lowest, by repeatedly adding the coefficient, then multiplying by the variable x. In contrast, the naive method (i.e., evaluating the powers of x first, then multiplying them by their coefficients and adding them together) takes more work and can be less numerically stable than Horner's method.

You haven't mentioned exactly how you're trying to evaluate your polynomials, so I will suggest a formula:

x2 = x * x
cos = ((COS_D * x2 + COS_C) * x2 + COS_B) * x2 + COS_A
sin = (((SIN_D * x2 + SIN_C) * x2 + SIN_B) * x2 + SIN_A) * x

Note that you can get better precision if you adapt your constants to the range over which you are evaluating the function, rather than using the Taylor coefficients. (Again, apologies if you have done some or all of these things, but you didn't mention what you had already tried...)


This is probably less relevant for your case (which presumably has just a 16x16-bit MAC), but if your processor can launch multiple arithmetic evaluations at once, you may be able to get better performance if you write your evaluation in a tree-like form, avoiding some of the sequential dependency of operations:

x2 = x * x
x4 = x2 * x2
cos = (COS_D * x2 + COS_C) * x4 + (COS_B * x2 + COS_A)
sin = ((SIN_D * x2 + SIN_C) * x4 + (SIN_B * x2 + SIN_A)) * x

If your processor has a vector ALU, this formula also suggests a productive use for it...

3
votes

If the MAC is significantly faster than an equivalent sequence of shifts and ands and adds, use a polynomial; don't even consider CORDIC (except possibly for a single step or two of range reduction). It's hard to find FP CORDIC algorithms precisely because that criteria always holds for any system where FP is used (in the past ~35 years), so CORDIC isn't considered.