
I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.

Here you can find the code I'm talking about:

int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;


    case GPU:

    HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));

    HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice)); 
    HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
    HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));

    float time;
    cudaEvent_t start, stop;

    HANDLE_ERROR( cudaEventCreate(&start) );
    HANDLE_ERROR( cudaEventCreate(&stop) );
    HANDLE_ERROR( cudaEventRecord(start, 0) );

    for(int m = 0; m < nb_loops ; m++)
        diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);

    HANDLE_ERROR( cudaEventRecord(stop, 0) );
    HANDLE_ERROR( cudaEventSynchronize(stop) );
    HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );

    HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));

    printf("Time to generate:  %4.4f ms \n", time/nb_loops);


    case CPU:

    clock_t begin = clock(), diff;

    for (int z=0; z<nb_loops; z++)
        // Apply the difference between 2 images
        for (int i = 0; i < height; i++)
            tmp = i*imgresult_pitch;
            for (int j = 0; j < width; j++)
                imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
    diff = clock() - begin;

    float msec = diff*1000/CLOCKS_PER_SEC;
    msec = msec/nb_loops;
    printf("Time taken %4.4f milliseconds", msec);


And here is my kernel function:

__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
    int row = blockIdx.x;
    int col = threadIdx.x;
    int v = col + row*blockDim.x;

    if (row < MAX_H && col < MAX_W)
        data_res[v] = (int) data2[v] - (int) data1[v];

I obtained these execution time for each one

  • CPU: 1,3210ms
  • GPU: 0,3229ms

I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.

EDIT1: Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.

However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...


Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:

if (row < MAX_H && col < MAX_W)
    data_res[v] = (int) data2[v] - (int) data1[v];

Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:

  1. Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.

  2. Border areas that will use your diff kernel.

Obviously you'll have modify your code that calls the kernels.

And on a separate note:

  1. GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?

  2. CUDA Profiler is a very handy tool that will tell you're not optimal in the code.


I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.

I see some details that you can test:

  1. I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().

  2. Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.

  3. CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.

  4. Remember as well to free your allocated memory in GPU and destroying your time events created.

  5. In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .

  6. Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.