0
votes

The following full code could compare speed of fast inverse square root with 1/sqrt(). According to this sentence in wikipedia, (i.e. The algorithm was approximately four times faster than computing the square root with another method and calculating the reciprocal via floating point division.)

But here is why I am here: it is slower than 1/sqrt(). something wrong in my code? please.

    #include <stdio.h>
    #include <time.h>
    #include <math.h>

    float FastInvSqrt (float number);

    int
    main ()
    {
      float x = 1.0e+100;

      int N = 100000000;
      int i = 0;

      clock_t start2 = clock (); 
      do  
        {   
          float z = 1.0 / sqrt (x);
          i++;
        }   
      while (i < N); 
      clock_t end2 = clock (); 

      double time2 = (end2 - start2) / (double) CLOCKS_PER_SEC;

      printf ("1/sqrt() spends %13f sec.\n\n", time2);

      i = 0;
      clock_t start1 = clock (); 
      do  
        {   
          float y = FastInvSqrt (x);
          i++;
        }   
      while (i < N); 
      clock_t end1 = clock (); 

      double time1 = (end1 - start1) / (double) CLOCKS_PER_SEC;



      printf ("FastInvSqrt() spends %f sec.\n\n", time1);


      printf ("fast inverse square root is faster %f times than 1/sqrt().\n", time2/time1);

      return 0;
}

float
FastInvSqrt (float x)
{
  float xhalf = 0.5F * x;
  int i = *(int *) &x;  // store floating-point bits in integer
  i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
  x = *(float *) &i;            // convert new bits into float
  x = x * (1.5 - xhalf * x * x);        // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  return x;
}

The result is as follows:

1/sqrt() spends      0.850000 sec.

FastInvSqrt() spends 0.960000 sec.

fast inverse square root is faster 0.885417 times than 1/sqrt().
2
On which value(s) are you testing your function? There could be an initial constant penalty in just running the code with, e.g., an input of 1. So maybe try some very large values, and do your benchmarking on that.Tim Biegeleisen
Your code is garbage because the loop contents don't use the results of the square root algorithms. Recode so that the loops calculate a different reciprocal square root every time and print out the sum of the results and enable optimization. Also your inverse square root takes at least 3 iterations to converge to single precision accuracy and should be computed perhaps 8 abreast so that the compiler can vectorize the operations with ymm regisers.user5713492
@San Tseng The reference for Wikipedia's claim that FastInvSqrt is four times faster than a particular alternative is a paper that was published in 2003, fifteen years ago. There is no particular reason to assume that what was true for the hardware platforms, toolchains, and libraries in 2003 still applies today, as all of those have changed substantially since then (e.g. the hardware instruction for square root on x86 processors have become much faster since then).njuffa
@DillonDavis Timing tests with optimization disabled are futile. The tests must calculate N different reciprocal square roots and must use all of the results in output, otherwise the optimizer can just skip the loops. To the O.P. your loops as presented execute 100000001 iterations, not just 1. If it took about a second to compute just 1 reciprocal square root, well, that would be just pitiful and it's not what is happening.user5713492
@user5713492 The floating-point units as a whole have gotten wider, but that applies to square roots and multiplies alike. Last I checked, the ratio of throughput between square roots and add/mul/fma had gotten significantly smaller than it was fifteen years ago, from 1:25 down to 1:6 (or thereabouts).njuffa

2 Answers

0
votes

I correct my code as follows: 1. compute random number, instead of a fixed number. 2. count time consumption inside while loop and sum of it.

#include <stdio.h>
#include <time.h>
#include <math.h>
#include <stdlib.h>

float FastInvSqrt (float number);

int
main ()
{
  float x=0;
  time_t t;

  srand((unsigned) time(&t));

  int N = 1000000;
  int i = 0;

  double sum_time2=0.0;

  do  
    {   
      x=(float)(rand() % 10000)*0.22158;
  clock_t start2 = clock (); 
      float z = 1.0 / sqrt (x);
  clock_t end2 = clock (); 
        sum_time2=sum_time2+(end2-start2);
      i++;
    }   
  while (i < N); 


  printf ("1/sqrt() spends %13f sec.\n\n", sum_time2/(double)CLOCKS_PER_SEC);

  double sum_time1=0.0;

  i = 0;
  do  

    {
      x=(float)(rand() % 10000)*0.22158;
  clock_t start1 = clock ();
      float y = FastInvSqrt (x);
  clock_t end1 = clock ();
        sum_time1=sum_time1+(end1-start1);
      i++;
    }
  while (i < N);

  printf ("FastInvSqrt() spends %f sec.\n\n", sum_time1/(double)CLOCKS_PER_SEC);

  printf ("fast inverse square root is faster %f times than 1/sqrt().\n", sum_time2/sum_time1);

  return 0;
}

float
FastInvSqrt (float x)
{
  float xhalf = 0.5F * x;
  int i = *(int *) &x;  // store floating-point bits in integer
  i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
  x = *(float *) &i;            // convert new bits into float
  x = x * (1.5 - xhalf * x * x);        // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  return x;
}

but fast inverse square root still slower that 1/sqrt().

1/sqrt() spends      0.530000 sec.

FastInvSqrt() spends 0.540000 sec.

fast inverse square root is faster 0.981481 times than 1/sqrt().
0
votes

A function that reduces the domain in which it computes with precision will have less computational complexity (meaning that it can be computed faster). This can be thought of as optimizing the computation of a function's shape for a subset of its definition, or like search algorithms which each are best for a particular kind of input (No Free Lunch theorem).

As such, using this function for inputs outside the interval [0, 1] (which I suppose it was optimized / designed for) means using it in the subset of inputs where its complexity is worse (higher) than other possibly specialized variants of functions that compute square roots.

The sqrt() function you are using from the library was itself (likely) also optimized, as it has pre-computed values in a sort of LUT (which act as initial guesses for further approximations); using such a more "general function" (meaning that it covers more of the domain and tries to efficientize it by precomputation, for example; or eliminating redundant computation, but that is limited; or maximizing data reuse at run-time) has its complexity limitations, because the more choices between which precomputation to use for an interval, the more decision overhead there is; so knowing at compile-time that all your inputs to sqrt are in the interval [0, 1] would help reduce the run-time decision overhead, as you would know ahead of time which specialized approximation function to use (or you could generate specialized functions for each interval of interest, at compile-time -> see meta-programming for this).