I am working on an architecture which does not feature floating point hardware, but only a 16 bit ALU and a 40 bit MAC.
I have already implemented 32-bit single precision floating point addition/subtraction, multiplication, cosine, sine, division, square root, and range reduction all in software on this architecture.
To implement cosine and sine I first used range reduction using the method described in the paper "ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG I then implemented a cosine and sine function which are polynomial approximations to the cosine and sine functions on the range -pi/4 to +pi/4. I referred to the book "Computer Approximations", Hart, et al. for the polynomials.
I have also heard that I should consider the CORDIC algorithm. However, I was wondering if anyone knows if it would be more or less efficient (in terms of throughput, memory overhead, and number of instructions required) than the method I already used? I have implemented my software functions on a multicore architecture where each core features only 128 words of instruction memory and a 128 word 16-bit data memory. Also I have tried searching for how to implement the CORDIC algorithm for cosine and sine, but I couldn't find any good resources for 32-bit floating point implementations. Does anybody have suggestions?
Thank you!