The accuracy issue raised in the question can effectively, and efficiently, be addressed by the use of limited amounts of double-float
or double-double
computation, facilitated by the use of the fused multiply-add (FMA) operation.
This operation is available since C99
via the standard math functions fmaf(a,b,c)
and fma(a,b,c)
which compute a*b+c, without rounding of the intermediate product. While the functions map directly to fast hardware operations on almost all modern processors, they may use emulation code on older platforms, in which case they may be be very slow.
This allows the computation of the product with twice the normal precision using just two operations, resulting in a head:tail pair of native-precision numbers:
prod_hi = a * b // head
prod_lo = FMA (a, b, -hi) // tail
The high-order bits of the result can be passed to the exponentiation, while the low-order bits are used for improving the accuracy of the result via linear interpolation, taking advantage of the fact that ex is its own derivative:
e = exp (prod_hi) + exp (prod_hi) * prod_lo // exp (a*b)
This allows us to eliminate most of the error of the naive implementation. The other, minor, source of computation error is the limited precision with which the constant 1/√(2π) is represented. This can be improved by using a head:tail representation for the constant that provides twice the native precision, and computing:
r = FMA (const_hi, x, const_lo * x) // const * x
The following paper points out that this technique can even result in correctly-rounded multiplication for some arbitrary-precision constants:
Nicolas Brisebarre and Jean-Michel Muller, "Correctly rounded multiplication by arbitrary precision constants", IEEE Transactions on Computers, Vol. 57, No. 2, February 2008, pp. 165-174
Combining the two techniques, and taking care of a few corner cases involving NaNs, we arrive at the following float
implementation based on IEEE-754 binary32
:
float my_normpdff (float a)
{
const float RCP_SQRT_2PI_HI = 0x1.988454p-02f; /* 1/sqrt(2*pi), msbs */
const float RCP_SQRT_2PI_LO = -0x1.857936p-27f; /* 1/sqrt(2*pi), lsbs */
float ah, sh, sl, ea;
ah = -0.5f * a;
sh = a * ah;
sl = fmaf (a, ah, 0.5f * a * a); /* don't flip "sign bit" of NaN argument */
ea = expf (sh);
if (ea != 0.0f) ea = fmaf (sl, ea, ea); /* avoid creation of NaN */
return fmaf (RCP_SQRT_2PI_HI, ea, RCP_SQRT_2PI_LO * ea);
}
The corresponding double
implementation, based on IEEE-754 binary64
, looks almost identical, except for the different constant values used:
double my_normpdf (double a)
{
const double RCP_SQRT_2PI_HI = 0x1.9884533d436510p-02; /* 1/sqrt(2*pi), msbs */
const double RCP_SQRT_2PI_LO = -0x1.cbc0d30ebfd150p-56; /* 1/sqrt(2*pi), lsbs */
double ah, sh, sl, ea;
ah = -0.5 * a;
sh = a * ah;
sl = fma (a, ah, 0.5 * a * a); /* don't flip "sign bit" of NaN argument */
ea = exp (sh);
if (ea != 0.0) ea = fma (sl, ea, ea); /* avoid creation of NaN */
return fma (RCP_SQRT_2PI_HI, ea, RCP_SQRT_2PI_LO * ea);
}
The accuracy of these implementations depends on the accuracy of the standard math functions expf()
and exp()
, respectively. Where the C math library provides faithfully-rounded versions of those, the maximum error of either of the two implementations above is typically less than 2.5 ulps.