why is 'fast inverse square root' slower than 1/sqrt() at extremely large float?

Question

The following full code could compare speed of fast inverse square root with 1/sqrt(). According to this sentence in wikipedia, (i.e. The algorithm was approximately four times faster than computing the square root with another method and calculating the reciprocal via floating point division.)

But here is why I am here: it is slower than 1/sqrt(). something wrong in my code? please.

    #include <stdio.h>
    #include <time.h>
    #include <math.h>

    float FastInvSqrt (float number);

    int
    main ()
    {
      float x = 1.0e+100;

      int N = 100000000;
      int i = 0;

      clock_t start2 = clock (); 
      do  
        {   
          float z = 1.0 / sqrt (x);
          i++;
        }   
      while (i < N); 
      clock_t end2 = clock (); 

      double time2 = (end2 - start2) / (double) CLOCKS_PER_SEC;

      printf ("1/sqrt() spends %13f sec.\n\n", time2);

      i = 0;
      clock_t start1 = clock (); 
      do  
        {   
          float y = FastInvSqrt (x);
          i++;
        }   
      while (i < N); 
      clock_t end1 = clock (); 

      double time1 = (end1 - start1) / (double) CLOCKS_PER_SEC;



      printf ("FastInvSqrt() spends %f sec.\n\n", time1);


      printf ("fast inverse square root is faster %f times than 1/sqrt().\n", time2/time1);

      return 0;
}

float
FastInvSqrt (float x)
{
  float xhalf = 0.5F * x;
  int i = *(int *) &x;  // store floating-point bits in integer
  i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
  x = *(float *) &i;            // convert new bits into float
  x = x * (1.5 - xhalf * x * x);        // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  return x;
}

The result is as follows:

1/sqrt() spends      0.850000 sec.

FastInvSqrt() spends 0.960000 sec.

fast inverse square root is faster 0.885417 times than 1/sqrt().

On which value(s) are you testing your function? There could be an initial constant penalty in just running the code with, e.g., an input of 1. So maybe try some very large values, and do your benchmarking on that. — Tim Biegeleisen
Your code is garbage because the loop contents don't use the results of the square root algorithms. Recode so that the loops calculate a different reciprocal square root every time and print out the sum of the results and enable optimization. Also your inverse square root takes at least 3 iterations to converge to single precision accuracy and should be computed perhaps 8 abreast so that the compiler can vectorize the operations with ymm regisers. — user5713492
@San Tseng The reference for Wikipedia's claim that FastInvSqrt is four times faster than a particular alternative is a paper that was published in 2003, fifteen years ago. There is no particular reason to assume that what was true for the hardware platforms, toolchains, and libraries in 2003 still applies today, as all of those have changed substantially since then (e.g. the hardware instruction for square root on x86 processors have become much faster since then). — njuffa
@DillonDavis Timing tests with optimization disabled are futile. The tests must calculate N different reciprocal square roots and must use all of the results in output, otherwise the optimizer can just skip the loops. To the O.P. your loops as presented execute 100000001 iterations, not just 1. If it took about a second to compute just 1 reciprocal square root, well, that would be just pitiful and it's not what is happening. — user5713492
@user5713492 The floating-point units as a whole have gotten wider, but that applies to square roots and multiplies alike. Last I checked, the ratio of throughput between square roots and add/mul/fma had gotten significantly smaller than it was fifteen years ago, from 1:25 down to 1:6 (or thereabouts). — njuffa

San Tseng San Tseng · Accepted Answer · 2018-07-24T03:10:54

I correct my code as follows: 1. compute random number, instead of a fixed number. 2. count time consumption inside while loop and sum of it.

#include <stdio.h>
#include <time.h>
#include <math.h>
#include <stdlib.h>

float FastInvSqrt (float number);

int
main ()
{
  float x=0;
  time_t t;

  srand((unsigned) time(&t));

  int N = 1000000;
  int i = 0;

  double sum_time2=0.0;

  do  
    {   
      x=(float)(rand() % 10000)*0.22158;
  clock_t start2 = clock (); 
      float z = 1.0 / sqrt (x);
  clock_t end2 = clock (); 
        sum_time2=sum_time2+(end2-start2);
      i++;
    }   
  while (i < N); 


  printf ("1/sqrt() spends %13f sec.\n\n", sum_time2/(double)CLOCKS_PER_SEC);

  double sum_time1=0.0;

  i = 0;
  do  

    {
      x=(float)(rand() % 10000)*0.22158;
  clock_t start1 = clock ();
      float y = FastInvSqrt (x);
  clock_t end1 = clock ();
        sum_time1=sum_time1+(end1-start1);
      i++;
    }
  while (i < N);

  printf ("FastInvSqrt() spends %f sec.\n\n", sum_time1/(double)CLOCKS_PER_SEC);

  printf ("fast inverse square root is faster %f times than 1/sqrt().\n", sum_time2/sum_time1);

  return 0;
}

float
FastInvSqrt (float x)
{
  float xhalf = 0.5F * x;
  int i = *(int *) &x;  // store floating-point bits in integer
  i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
  x = *(float *) &i;            // convert new bits into float
  x = x * (1.5 - xhalf * x * x);        // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  return x;
}

but fast inverse square root still slower that 1/sqrt().

1/sqrt() spends      0.530000 sec.

FastInvSqrt() spends 0.540000 sec.

fast inverse square root is faster 0.981481 times than 1/sqrt().

why is 'fast inverse square root' slower than 1/sqrt() at extremely large float?

2 Answers