The inverse hyperbolic function asinh() is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh() from the C99 standard math function log1p(). For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf() and log1pf(). I intend to re-use the exact same algorithm for double precision computation, i.e. asinh() and log1p(), later.
My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf() (reciprocal square root).
The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.
So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf(), but I have not done so at this stage to avoid unnecessary obfuscation.
/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
= log1p (a + (sqrt (a*a+1) - 1))
= log1p (a + sqrt1pm1 (a*a))
= log1p (a + (a*a / (1 + sqrt(a*a + 1))))
= log1p (a + a * (a / (1 + sqrt(a*a + 1))))
= log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
= log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
float fa, t;
fa = fabsf (a);
#if !USE_RECIPROCAL
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = log1pf (t);
}
#else // USE_RECIPROCAL
if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = 1.0f / fa;
t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
t = log1pf (t);
}
#endif // USE_RECIPROCAL
return copysignf (t, a); // restore sign
}
With a particular log1pf() implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 232 possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.
In terms of performance, the variant USE_RECIPROCAL = 0 will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1 could be faster if very fast reciprocal support is available.
Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf() may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.
floatand move todoubleasap, or evenlong doubleif your compiler supports 80-bit real number representation. - Weather Vanefloatoperations can have significantly higher throughput thandoubleoperations, and on many platform including x86 with AVX there is no convenient way to use any type wider thandouble. - njuffadoublesupport but it may be up to 32x slower thanfloat, so not practical given the need for high performance. In addition I am using thefloatversion as an experimental platform for thedoubleversion, and there is no hardware support for any higher precision beyonddouble- njuffafloatcase, but I am intending to re-use the same algorithm fordoublelater, as I mentioned at the start of the question. - njuffa