5
votes

I am currently searching for a very fast integer square root approximation, where floor(sqrt(x)) <= veryFastIntegerSquareRoot(x) <= x

The square root routine is used for calculating prime numbers, which get considerably faster if only values below or equals sqrt(x) are checked for being a divisor of x.

What I am currently having is this function from Wikipedia, adjusted a small bit to work with 64-bit integers.

Because I have no other function to compare against (or more precise, the function is too precise for my purposes, and it probably takes more time, than being higher than the actual result.)

4
Did you profile your code? If you're doing trial division, the sqrt operation is quite unlikely to be the bottleneck.user2357112 supports Monica
The jump-free Newton-Raphson converges in just four steps, IIRC. (for 32 bit ints)wildplasser
@AntoineMathys: sqrt only has to be computed once, though. If you do i * i <= x, that's an extra multiplication every iteration.user2357112 supports Monica
the complexity still looks like O(n log n) or O(log n). Finding the prime numbers up to 10000000 takes roughly 40 seconds on a pi 2, while finding them up to 1000000 takes 2.Morten
@AntoineMathys I just tested it out. I am assigning sqrt(x) to a variable (of course) and the loop for up to 10000000 (when the innermost loop is empty) takes 4 seconds as opposed to the constant re-evaluation of i*i, which takes over two minutes.Morten

4 Answers

3
votes

Loopfree/jumpfree (well: almost ;-) Newton-Raphson:

/* static will allow inlining */
static unsigned usqrt4(unsigned val) {
    unsigned a, b;

    if (val < 2) return val; /* avoid div/0 */

    a = 1255;       /* starting point is relatively unimportant */

    b = val / a; a = (a+b) /2;
    b = val / a; a = (a+b) /2;
    b = val / a; a = (a+b) /2;
    b = val / a; a = (a+b) /2;

    return a;
}

For 64-bit ints you will need a few more steps (my guess: 6)

1
votes

On modern PC hardware, computing the square root of n may well be faster using floating point arithmetics than any kind of fast integer math, but as for all performance issues, careful benchmarking is required.

Note however that computing the square root may not be required at all: you can instead square the loop index and stop when the square exceeds the value of n. The dominant operation is the division in the loop body anyway:

#define PBITS32  ((1<<2) | (1<<3) | (1<<5) | (1<<7) | (1<<11) | (1<<13) | \
                  (1UL<<17) | (1UL<<19) | (1UL<<23) | (1UL<<29) | (1UL<<31))

int isprime(unsigned int n) {
    if (n < 32)
        return (PBITS32 >> n) & 1;
    if ((n & 1) == 0)
        return 0;
    for (unsigned int p = 3; p * p <= n; p += 2) {
        if (n % p == 0)
            return 0;
    }
    return 1;
}
1
votes

Computing floor(sqrt(x)) Exactly

Here is my solution, which is based on the bit-guessing approach proposed on Wikipedia. Unfortunately the pseudo-code provided on Wikipedia has some errors so I had to make some adjustments:

unsigned char bit_width(unsigned long long x) {
    return x == 0 ? 1 : 64 - __builtin_clzll(x);
}

// implementation for all unsigned integer types
unsigned sqrt(const unsigned n) {
    unsigned char shift = bit_width(n);
    shift += shift & 1; // round up to next multiple of 2

    unsigned result = 0;

    do {
        shift -= 2;
        result <<= 1; // leftshift the result to make the next guess
        result |= 1;  // guess that the next bit is 1
        result ^= result * result > (n >> shift); // revert if guess too high
    } while (shift != 0);

    return result;
}

bit_width can be evaluated in constant time and the loop will iterate at most ceil(bit_width / 2) times. So even for a 64-bit integer, this will be at worst 32 iterations of basic arithmetic and bitwise operations.

Unlike all other answers proposed so far, this actually gives you the best possible approximation, which is floor(sqrt(x)). For any x2, this will return x exactly.

Making a Guess using log2(x)

If this is still too slow for you, you can make a guess solely based on the binary logarithm. The basic idea is that we can compute the sqrt of any number 2x using 2x/2. x/2 might have a remainder, so we can't always compute this exactly, but we can compute an upper and lower bound.

For example:

  1. we are given 25
  2. compute floor(log_2(25)) = 4
  3. compute ceil(log_2(25)) = 5
  4. lower bound: pow(2, floor(4 / 2)) = 4
  5. upper bound: pow(2, ceil(5 / 2)) = 8

And indeed, the actual sqrt(25) = 5. We found sqrt(16) >= 4 and sqrt(32) <= 8. This means:

4 <= sqrt(16) <= sqrt(25) <= sqrt(32) <= 8
            4 <= sqrt(25) <= 8

This is how we can implement these guesses, which we will call sqrt_lo and sqrt_hi.

// this function computes a lower bound
unsigned sqrt_lo(const unsigned n) noexcept
{
    unsigned log2floor = bit_width(n) - 1;
    return (unsigned) (n != 0) << (log2floor >> 1);
}

// this function computes an upper bound
unsigned sqrt_hi(const unsigned n)
{
    bool isnt_pow2 = ((n - 1) & n) != 0; // check if n is a power of 2
    unsigned log2ceil = bit_width(n) - 1 + isnt_pow2;
    log2ceil += log2ceil & 1; // round up to multiple of 2
    return (unsigned) (n != 0) << (log2ceil >> 1);
}

For these two functions, the following statement is always true:

sqrt_lo(x) <= floor(sqrt(x)) <= sqrt(x) <= sqrt_hi(x) <= x

Note that if we assume that the input is never zero, then (unsigned) (n != 0) can be simplified to 1 and the statement is still true.

These functions can be evaluated in O(1) with a hardware-__builtin_clzll instruction. They only give precise results for numbers 22x, so 256, 64, 16, etc.

0
votes

This version can be faster as DIV is slow and for small numbers (Val<20k) this version reduces the error to less than 5%. Tested on ARM M0 (With no DIV hardware acceleration)

static uint32_t usqrt4_6(uint32_t val) {
    uint32_t a, b;

    if (val < 2) return val; /* avoid div/0 */
    a = 1255;       /* starting point is relatively unimportant */
    b = val / a; a = (a + b)>>1;
    b = val / a; a = (a + b)>>1;
    b = val / a; a = (a + b)>>1;
    b = val / a; a = (a + b)>>1;
    if (val < 20000) {  
        b = val / a; a = (a + b)>>1;    // < 17% error Max
        b = val / a; a = (a + b)>>1;    // < 5%  error Max
    }
    return a;
}