One way to implement FMA in software is by splitting the significant into high and low bits. I use Dekker's algorithm
typedef struct { float hi; float lo; } doublefloat;
doublefloat split(float a) {
float t = ((1<<12)+1)*a;
float hi = t - (t - a);
float lo = a - hi;
return (doublefloat){hi, lo};
}
Once you split the the float you can calculate a*b-c
with a single rounding like this
float fmsub(float a, float b, float c) {
doublefloat as = split(a), bs = split(b);
return ((as.hi*bs.hi - c) + as.hi*bs.lo + as.lo*bs.hi) + as.lo*bs.lo;
}
This basically subtracts c
from (ahi,alo)*(bhi,blo) = (ahi*bhi + ahi*blo + alo*bhi + alo*blo)
.
I got this idea from the twoProd
function in the paper Extended-Precision Floating-Point Numbers for GPU Computation and from the mul_sub_x
function in Agner Fog's vector class library. He uses a different function for splitting vectors of floats which splits differently. I tried to reproduce a scalar version here
typedef union {float f; int i;} u;
doublefloat split2(float a) {
u lo, hi = {a};
hi.i &= -(1<<12);
lo.f = a - hi.f;
return (doublefloat){hi.f,lo.f};
}
In any case using split
or split2
in fmsub
agrees well with fma(a,b,-c)
from the math library in glibc. For whatever reason my version is significantly faster than fma
except on a machine that has hardware fma (in which case I use _mm_fmsub_ss
anyway).
-mfma
or-mfma4
or-march=something
wheresomething
is a fma-supporting processor). On Linux, you might look atsysdeps/ieee754/dbl-64/s_fma.c
in glibc to get an idea of what the library function fallback looks like. – tmyklebu