I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:
#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()
uint64_t divisor, recip;
float r, s, t;
t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor
While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t
with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:
#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}
Implementations of nextafterf()
are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32
into an int32
and vice versa, via intrinsics float_as_int()
and int_as_float()
, we can combine use of nextafterf()
and scaling as follows:
s = int_as_float (float_as_int (r) + 0x1fffffff);
Assuming these approaches are possible on a given platform, this leaves us with the conversions between float
and uint64_t
as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t
to float
with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t
and floating-point types, making this a performance bottleneck.
t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
A portable, but slow, implementation of uint64_to_float_ru
uses dynamic changes to FPU rounding mode:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
float uint64_to_float_ru (uint64_t a)
{
float res;
int curr_mode = fegetround ();
fesetround (FE_UPWARD);
res = (float)a;
fesetround (curr_mode);
return res;
}
I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float
which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.
Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-
) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.
uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
float r, s, t;
/* compute initial approximation for reciprocal; must be underestimate! */
t = uint64_to_float_ru (divisor);
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
/* perform Halley iteration with cubic convergence to refine reciprocal */
temp = neg_divisor * recip;
temp = umul64hi (temp, temp) + temp;
recip = umul64hi (recip, temp) + recip;
/* compute preliminary quotient and remainder */
quot = umul64hi (dividend, recip);
rem = dividend - divisor * quot;
/* adjust quotient if too small; quotient off by 2 at most */
if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;
/* handle division by zero */
if (divisor == 0ULL) quot = ~0ULL;
return quot;
}
umul64hi()
would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:
inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
uint64_t res;
__asm__ (
"movq %1, %%rax;\n\t" // rax = a
"mulq %2;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res = (a * b)<63:32>
: "=rm" (res)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
return res;
}
/O3, /QxHOST
). One look at generated assembly code was enough to convince me that this initial approximation lacks efficiency (the NR iterations are fine). Way too many instructions, many related to splittinguint64_t
for the conversions, it seems. On an NVIDIA GPU, using intrinsics, this approach can map to about five instructions or so and is useable – njuffa