CUDA GPU slower than CPU

Question

I am having trouble figuring out why my cuda code runs slower than my cpu code

my desktop configuration is i7 2600S, geforce 560ti

and my code is as follows:

int** kernel_shiftSeam(int **MCEnergyMat, int **newE, int *seam, int width, int height,     int direction)
{
//time measurement
float elapsed_time_ms = 0;
cudaEvent_t start, stop; //threads per block

dim3 threads(16,16);
//blocks
dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y);

int *device_Seam;

int *host_Seam;

int seamSize;
if(direction == 1)
{
    seamSize = height*sizeof(int);
    host_Seam = (int*)malloc(seamSize);
    for(int i=0;i<height;i++)
    host_Seam[i] = seam[i];
}
else
{
    seamSize = width*sizeof(int);
    host_Seam = (int*)malloc(seamSize);
    for(int i=0;i<width;i++)
        host_Seam[i] = seam[i];
}

cudaMalloc((void**)&device_Seam, seamSize);
cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice);

global_host_MC = MCEnergyMat;
new_host_MC = newE;

//copy host array to device
cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
    for(int i=0;i<width;i++)
        cudaMemcpy(global_MC2[i], global_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);

cudaMemcpy(new_MC, new_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
    for(int i=0;i<width;i++)
        cudaMemcpy(new_MC2[i], new_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

//do some operations on the 2d matrix
gpu_shiftSeam<<< blocks,threads >>>(global_MC, new_MC, device_Seam, width, height);

//measure end time for cpu calcuations
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );

execTime += elapsed_time_ms;

//copy out the data back to host (RESULT)
for(int i=0;i<width;i++)
{
    cudaMemcpy(newE[i], new_MC2[i], sizeof(int)*height, cudaMemcpyDeviceToHost);
}

return newE;
}

I looped it 800 times and I got the follow results:

GPU Computation Time (the gpu_shiftseam part) : 1176ms Total program run time: 22s

CPU Computation Time (same operation as gpu_shiftseam but on host) : 12522ms Total program run time: 12s

Apparently the GPU computation time is way shorter than the one on CPU, but for some reason the total program run time for gpu is a lot longer, does anyone know why? Is it because of the number of threads/blocks I am assigning is incorrect? Or is the "slowness" coming from allocating memory on device?

Thanks a lot!

Just move the timers around, or create more timers, so that you can see where the time goes. Maybe the time is used in the cudaMemcpy() calls. — Roger Dahl
what if the time is used in cudaMemcpy() calls like you said? Does that mean that it is inevitable to spend this much time using that function? Because I dont think theres an alternative to cudaMemcpy() — user1360113
Before optimizing I would time the subsections of kernel_shiftSeam or use one of the profilers (Parallel Nsight, CUDA profiler, NVIDIA Visual Profiler). — Greg Smith
Items that are causing overhead include but are not limited to: (a) CPU memory copy to fix direction, (b) malloc, (c) cudaMalloc, (d) CPU memory copy in cudaMemcpy due to copying non-pinned memory to the GPU, (e) blocking call to cudaMemcpy, unnecessary cudaEventSynchronize (move to the end), division of cudaMemcopy into many calls. You can look into using cudaMallocPitch/cudaMemcpy2D to handle 2D copies. You are also leaking both CPU and GPU memory which on WDDM will slow down each successive GPU operation. — Greg Smith

TheRealKingK TheRealKingK · Accepted Answer · 2012-04-30T05:01:14

Im my experience memory accesses are the #1 reason for slowness.

Profile your array copies to see how much time is being spent. If it is a considerable amount, perhaps try optimizing your code. Instead of copying inside of a for-loop, perhaps see if you can copy sizeof(int *) * height * width directly. Reducing the amount of times you call memcpy should help.

cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
cudaMemcpy(global_MC2, global_host_MC, sizeof(int)*height*width,cudaMemcpyHostToDevice);

CUDA GPU slower than CPU

2 Answers