GPU gives no performance improvement in Julia set computation

Question

I am trying to compare performance in CPU and GPU. I have

CPU : Intel® Core™ i5 CPU M 480 @ 2.67GHz × 4
GPU : NVidia GeForce GT 420M

I can confirm that GPU is configured and works correctly with CUDA.

I am implementing Julia set computation. http://en.wikipedia.org/wiki/Julia_set Basically for every pixel, if the co-ordinate is in the set it will paint it red else paint it white.

Although, I get identical answer with both CPU and GPU but instead of getting a performance improvement, I get a performance penalty by using GPU.

Running times

CPU : 0.052s
GPU : 0.784s

I am aware that transferring data from device to host can take up some time. But still, how do I know if use of GPU is actually beneficial?

Here is the relevant GPU code

    #include <stdio.h>
    #include <cuda.h>

    __device__ bool isJulia( float x, float y, float maxX_2, float maxY_2 )
    {
        float z_r = 0.8 * (float) (maxX_2 - x) / maxX_2;
        float z_i = 0.8 * (float) (maxY_2 - y) / maxY_2;

        float c_r = -0.8;
        float c_i = 0.156;
        for( int i=1 ; i<100 ; i++ )
        {
        float tmp_r = z_r*z_r - z_i*z_i + c_r;
        float tmp_i = 2*z_r*z_i + c_i;

        z_r = tmp_r;
        z_i = tmp_i;

        if( sqrt( z_r*z_r + z_i*z_i ) > 1000 )
            return false;
        }
        return true;
    }

    __global__ void kernel( unsigned char * im, int dimx, int dimy )
    {
        //int tid = blockIdx.y*gridDim.x + blockIdx.x;
        int tid = blockIdx.x*blockDim.x + threadIdx.x;
        tid *= 3;
        if( isJulia((float)blockIdx.x, (float)threadIdx.x, (float)dimx/2, (float)dimy/2)==true )
        {
        im[tid] = 255;
        im[tid+1] = 0;
        im[tid+2] = 0;
        }
        else
        {
        im[tid] = 255;
        im[tid+1] = 255;
        im[tid+2] = 255;
        }

    }

    int main()
    {
        int dimx=768, dimy=768;

        //on cpu
        unsigned char * im = (unsigned char*) malloc( 3*dimx*dimy );

        //on GPU
        unsigned char * im_dev;

        //allocate mem on GPU
        cudaMalloc( (void**)&im_dev, 3*dimx*dimy ); 

        //launch kernel. 
**for( int z=0 ; z<10000 ; z++ ) // loop for multiple times computation**
{
        kernel<<<dimx,dimy>>>(im_dev, dimx, dimy);
}

        cudaMemcpy( im, im_dev, 3*dimx*dimy, cudaMemcpyDeviceToHost );

        writePPMImage( im, dimx, dimy, 3, "out_gpu.ppm" ); //assume this writes a ppm file

        free( im );
        cudaFree( im_dev );
    }

Here is the CPU code

    bool isJulia( float x, float y, float maxX_2, float maxY_2 )
    {
        float z_r = 0.8 * (float) (maxX_2 - x) / maxX_2;
        float z_i = 0.8 * (float) (maxY_2 - y) / maxY_2;

        float c_r = -0.8;
        float c_i = 0.156;
        for( int i=1 ; i<100 ; i++ )
        {
        float tmp_r = z_r*z_r - z_i*z_i + c_r;
        float tmp_i = 2*z_r*z_i + c_i;

        z_r = tmp_r;
        z_i = tmp_i;

        if( sqrt( z_r*z_r + z_i*z_i ) > 1000 )
            return false;
        }
        return true;
    }


    #include <stdlib.h>
    #include <stdio.h>

    int main(void)
    {
      const int dimx = 768, dimy = 768;
      int i, j;

      unsigned char * data = new unsigned char[dimx*dimy*3];

**for( int z=0 ; z<10000 ; z++ ) // loop for multiple times computation**
{
      for (j = 0; j < dimy; ++j)
      {
        for (i = 0; i < dimx; ++i)
        {
          if( isJulia(i,j,dimx/2,dimy/2) == true )
          {
          data[3*j*dimx + 3*i + 0] = (unsigned char)255;  /* red */
          data[3*j*dimx + 3*i + 1] = (unsigned char)0;  /* green */
          data[3*j*dimx + 3*i + 2] = (unsigned char)0;  /* blue */
          }
          else
          {
          data[3*j*dimx + 3*i + 0] = (unsigned char)255;  /* red */
          data[3*j*dimx + 3*i + 1] = (unsigned char)255;  /* green */
          data[3*j*dimx + 3*i + 2] = (unsigned char)255;  /* blue */
          }
        }
      }
}

      writePPMImage( data, dimx, dimy, 3, "out_cpu.ppm" ); //assume this writes a ppm file
      delete [] data


      return 0;
    }

Further, following suggestions from @hyde I have looped the computation-only part to generate 10,000 images. I am not bothering to write all those images though. Computation only is what I am doing.

Here are the running times

CPU : more than 10min and code still running
GPU : 1m 14.765s

You should certainly make a test, which calculates like a 1000 images, instead of one. Maybe do that and update your question... And leave the saving out of the inner loop, calculate maybe 100 megabytes worth of images. — hyde
"And leave the saving out of the inner loop". clarify please — mkuse
I mean, if you want to measure CPU-GPU performance difference, then it's probably best to not do other things in the same loop... At least unless you can do PPM saving in another thread, while starting to calculate next image at the same time, in which case you might get benefit from increased parallelism. — hyde
Following your suggestion I have looped over the compute-loop. I guess I am satisfied GPU indeed works faster. Are are inferences correct? — mkuse
@mkuse: The GTX420M is a small chip -- it has only 48 cores. In comparison, the GTX480 has 10 times as many cores. You probably end not being able to compensate for the overhead of copying the data from device to host. Here are some benchmarks from a project I did which involves Mandelbrot calculations. — Roger Dahl

hyde hyde · Accepted Answer · 2013-01-30T13:05:25

Turning comments to answer:

To get relevant figures, you needs to calculate more than one image, so that execution time is seconds or tens of seconds at least. Also, including file saving time in results is going to add noise and hide the actual CPU vs GPU difference.

Another way to get real results is to select a Julia set which has lot points belonging to the set, then upping the iteration count so high it takes many seconds to calculate just one image. Then there is only one single calculation setup, so this is likely to be the most advantageous scenario for GPU/CUDA.

To measure how much overhead there is, change image size to 1x1 and iteration limit 1, and then calculate enough images that it takes at least a few seconds. In this scenario, GPU is likely significantly slower.

To get most relevant timings for your use case, select image size and iteration count you are really going to use, and then measure the image count, where both versions are equally fast. That will give you a rough rule-of-thumb to decide which you should use when.

Alternative approach for practical results, if you are going to get just one image: find the iteration limit for single worst-case image, where CPU and GPU are equally fast. If that many or more iterations would be advantageous, choose GPU, otherwise choose CPU.

GPU gives no performance improvement in Julia set computation

1 Answers