I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add
, mul
, sub
, shift
in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.
This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res=0;
unsigned short int add= 0x8000;
int i;
for(i=0;i<16;i++)
{
unsigned short int temp=res | add;
unsigned int g2=temp*temp;
if (x>=g2)
{
res=temp;
}
add>>=1;
}
return res;
}
Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.
Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?
FAQ:
Why don't you rely on the compiler to produce a good fast code?
There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.
Did you profile the code to see if
sqrt
is actually a bottleneck?No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this
sqrt
is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.Have you tried to optimize the algorithm to perhaps remove the
sqrt
?Yes, I did that already. In fact, I got rid of 2
sqrt
s already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.What is the application?
It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).
Why don't you work directly in target ASM?
Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".
Why can't you use a division?
Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).
Can you use look-up tables?
The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for
sqrt
.What's the range of the values for
sqrt
?I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)
clz
or something similar as a 1 cycle instruction.) – biziclop