Optimize Cuda Kernel time execution

Question

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.

Here you can find the code I'm talking about:

int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;

switch(computing_type)
{

    case GPU:

    HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));

    HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice)); 
    HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
    HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));

    float time;
    cudaEvent_t start, stop;

    HANDLE_ERROR( cudaEventCreate(&start) );
    HANDLE_ERROR( cudaEventCreate(&stop) );
    HANDLE_ERROR( cudaEventRecord(start, 0) );

    for(int m = 0; m < nb_loops ; m++)
    {
        diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
    }

    HANDLE_ERROR( cudaEventRecord(stop, 0) );
    HANDLE_ERROR( cudaEventSynchronize(stop) );
    HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );

    HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));

    printf("Time to generate:  %4.4f ms \n", time/nb_loops);

    break;

    case CPU:

    clock_t begin = clock(), diff;

    for (int z=0; z<nb_loops; z++)
    {
        // Apply the difference between 2 images
        for (int i = 0; i < height; i++)
        {
            tmp = i*imgresult_pitch;
            for (int j = 0; j < width; j++)
            {
                imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
            }
        }
    }
    diff = clock() - begin;

    float msec = diff*1000/CLOCKS_PER_SEC;
    msec = msec/nb_loops;
    printf("Time taken %4.4f milliseconds", msec);

    break;
}

And here is my kernel function:

__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
    int row = blockIdx.x;
    int col = threadIdx.x;
    int v = col + row*blockDim.x;

    if (row < MAX_H && col < MAX_W)
    {
        data_res[v] = (int) data2[v] - (int) data1[v];
    }
}

I obtained these execution time for each one

CPU: 1,3210ms
GPU: 0,3229ms

I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.

EDIT1: Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.

However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...

Flow.

So your GPU result is 4x faster than your CPU result? What were you expecting? — Robert Crovella
How many loops are you running? There's significant overhead when copying to/from the GPU. — 3Dave

u354356007 u354356007 · Accepted Answer · 2016-07-15T15:36:24

Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:

if (row < MAX_H && col < MAX_W)
{
    data_res[v] = (int) data2[v] - (int) data1[v];
}

Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:

Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.

Obviously you'll have modify your code that calls the kernels.

And on a separate note:

GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Optimize Cuda Kernel time execution

2 Answers