Efficient computation of 2**64 / divisor via fast floating-point reciprocal

Question

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 2⁶⁴ / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:

#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()

uint64_t divisor, recip;
float r, s, t;

t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor

While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:

#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}

Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:

s = int_as_float (float_as_int (r) + 0x1fffffff);

Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.

t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

float uint64_to_float_ru (uint64_t a)
{
    float res;
    int curr_mode = fegetround ();
    fesetround (FE_UPWARD);
    res = (float)a;
    fesetround (curr_mode);
    return res;
}

I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.

Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.

uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
    uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
    float r, s, t;

    /* compute initial approximation for reciprocal; must be underestimate! */
    t = uint64_to_float_ru (divisor);
    r = 1.0f / t;
    s = 0x1.0p64f * nextafterf (r, 0.0f);
    recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

    /* perform Halley iteration with cubic convergence to refine reciprocal */
    temp = neg_divisor * recip;
    temp = umul64hi (temp, temp) + temp;
    recip = umul64hi (recip, temp) + recip;

    /* compute preliminary quotient and remainder */
    quot = umul64hi (dividend, recip); 
    rem = dividend - divisor * quot;

    /* adjust quotient if too small; quotient off by 2 at most */
    if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;

    /* handle division by zero */
    if (divisor == 0ULL) quot = ~0ULL;

    return quot;
}

umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:

inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
    uint64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"  // rax = a
        "mulq  %2;\n\t"         // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"  // res = (a * b)<63:32>
        : "=rm" (res)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    return res;
}

Given that floating point reciprocal is an obvious and common operation, shouldn't your compiler be smart enough to emit optimized code for it, assuming your ISA supports it and you have told the compiler so? — John Zwinck
@JohnZwinck Maybe :-) Usually it involves fiddling with compiler switches, which then negatively affect other code in undesired ways. Intrinsics are fine, they can often be abstracted away into a set of "generic intrinsics" that map closely to platform-specific ones (see the SIMD source code for GROMACS as a worked example). In any event, the floating-point reciprocal isn't really my problem here, the conversions are killing my approach (except on GPUs). — njuffa
Did you benchmark? How? Which target details? Which toolchain? What was the outcome? Why do you think "fiddling with compiler switches" is not required for your code? If you want full control of the generated code, you eventually have to use Assembler. — too honest for this site
@Olaf: This is exploratory work ideally applicable to multiple platforms. Eventually may go down to assembly language level, but premature now (focus on algorithm). At present using Intel compiler on an x86_64 platform to build the code (/O3, /QxHOST). One look at generated assembly code was enough to convince me that this initial approximation lacks efficiency (the NR iterations are fine). Way too many instructions, many related to splitting uint64_t for the conversions, it seems. On an NVIDIA GPU, using intrinsics, this approach can map to about five instructions or so and is useable — njuffa
Also similar question here: stackoverflow.com/questions/35063224/… — user3528438

tolkienfan tolkienfan · Accepted Answer · 2016-10-14T21:28:17

This solution combines two ideas:

You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.

Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.

The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.

After this you'd want to do the Newton-Raphson iteration(s).

Much of this code simply converts to magic numbers.

double                                                       
u64tod_inv( uint64_t u64 ) {                                 
  __asm__( "#annot0" );                                      
  union {                                                    
    double f;                                                
    struct {                                                 
      unsigned long m:52; // careful here with endianess     
      unsigned long x:11;                                    
      unsigned long s:1;                                     
    } u64;                                                   
    uint64_t u64i;                                           
  } z,                                                       
        magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },        
        magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },   
        magic2 = { .u64 = { 0, 2046, 0 } };                  

  __asm__( "#annot1" );                                      
  if( u64 < (1UL << 52UL ) ) {                               
    z.u64i = u64 + magic0.u64i;                              
    z.f   -= magic0.f;                                       
  } else {                                                   
    z.u64i = ( u64 >> 12 ) + magic1.u64i;                    
    z.f   -= magic1.f;                                       
  }                                                          
  __asm__( "#annot2" );                                      

  z.u64i = magic2.u64i - z.u64i;                             

  return z.f;                                                
}

Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.

I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

Efficient computation of 2**64 / divisor via fast floating-point reciprocal

1 Answers