Most accurate way to compute asinhf() from log1pf()?

Question

The inverse hyperbolic function asinh() is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh() from the C99 standard math function log1p(). For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf() and log1pf(). I intend to re-use the exact same algorithm for double precision computation, i.e. asinh() and log1p(), later.

My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf() (reciprocal square root).

The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.

So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf(), but I have not done so at this stage to avoid unnecessary obfuscation.

/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
                        = log1p (a + (sqrt (a*a+1) - 1))
                        = log1p (a + sqrt1pm1 (a*a))
                        = log1p (a + (a*a / (1 + sqrt(a*a + 1))))
                        = log1p (a + a * (a / (1 + sqrt(a*a + 1))))
                        = log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
                        = log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
    float fa, t;
    fa = fabsf (a);
#if !USE_RECIPROCAL
    if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
        t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
    } else {
        t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
        t = log1pf (t);
    }
#else // USE_RECIPROCAL
    if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
        t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
    } else {
        t = 1.0f / fa;
        t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
        t = log1pf (t);
    }
#endif // USE_RECIPROCAL
    return copysignf (t, a); // restore sign
}

With a particular log1pf() implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 2³² possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.

In terms of performance, the variant USE_RECIPROCAL = 0 will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1 could be faster if very fast reciprocal support is available.

Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf() may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.

If you want accuracy, ditch float and move to double asap, or even long double if your compiler supports 80-bit real number representation. — Weather Vane
That is not a realistic option for performance reasons, all intermediate computation must occur at the "native precision" of the input and result. I'll edit that requirement into the question. — njuffa
By "native" I meant "the same type as used in the function prototype". There are commonly used platforms (GPUs) where float operations can have significantly higher throughput than double operations, and on many platform including x86 with AVX there is no convenient way to use any type wider than double. — njuffa
Yes, all platforms I am considering do have double support but it may be up to 32x slower than float, so not practical given the need for high performance. In addition I am using the float version as an experimental platform for the double version, and there is no hardware support for any higher precision beyond double — njuffa
I want the best accuracy under the constraints stated, that is, without totally screwing performance. The question as written is about the float case, but I am intending to re-use the same algorithm for double later, as I mentioned at the start of the question. — njuffa

Simon Byrne Simon Byrne · Accepted Answer · 2015-12-31T15:39:54

Firstly, you may want to look into the accuracy and speed of your log1pf function: these can vary a bit between libms (I've found the OS X math functions to be fast, the glibc ones to be slower but typically correctly rounded).

Openlibm, based on the BSD libm, which in turn is based on Sun's fdlibm, use multiple approaches by range, but the main bit is the relation:

t = x*x;
w = log1pf(fabsf(x)+t/(one+sqrtf(one+t)));

You may also want to try compiling with the -fno-math-errno option, which disables the old System V error codes for sqrt (IEEE-754 exceptions will still work).

Most accurate way to compute asinhf() from log1pf()?

2 Answers