How do hardware implementations of a floating-point square root work? Which algorithm would they use and can anyone provide links to verilog/vhdl implementations?
2 Answers
AFAIK, either a digit-recurrence algorithm (little resource) or Newton's iteration on the reciprocal square root (needs other operators: adder, multiplier, or FMA).
Concerning Newton's iteration, the choice of the initial approximation is not obvious. See Kornerup and Muller's article Choosing starting values for certain Newton–Raphson iterations.
You get the best bang for the money by implementing an approximation for 1 / sqrt (x) in hardware, giving maybe ten or twelve bits of precision, like Intel processors do. Then you use good old Newton iteration to improve that approximation using add/subtract/multiply only, and multiply the last approximation by x.
Alternatively, consider that calculating the square root of x is the same as dividing x by the square root of x. You can implement something very similar to a division, giving one bit of precision each time, except that the number you are dividing by changes in every iteration.