3
votes

I am having trouble figuring out why my cuda code runs slower than my cpu code

my desktop configuration is i7 2600S, geforce 560ti

and my code is as follows:

int** kernel_shiftSeam(int **MCEnergyMat, int **newE, int *seam, int width, int height,     int direction)
{
//time measurement
float elapsed_time_ms = 0;
cudaEvent_t start, stop; //threads per block

dim3 threads(16,16);
//blocks
dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y);

int *device_Seam;

int *host_Seam;

int seamSize;
if(direction == 1)
{
    seamSize = height*sizeof(int);
    host_Seam = (int*)malloc(seamSize);
    for(int i=0;i<height;i++)
    host_Seam[i] = seam[i];
}
else
{
    seamSize = width*sizeof(int);
    host_Seam = (int*)malloc(seamSize);
    for(int i=0;i<width;i++)
        host_Seam[i] = seam[i];
}

cudaMalloc((void**)&device_Seam, seamSize);
cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice);

global_host_MC = MCEnergyMat;
new_host_MC = newE;

//copy host array to device
cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
    for(int i=0;i<width;i++)
        cudaMemcpy(global_MC2[i], global_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);

cudaMemcpy(new_MC, new_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
    for(int i=0;i<width;i++)
        cudaMemcpy(new_MC2[i], new_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

//do some operations on the 2d matrix
gpu_shiftSeam<<< blocks,threads >>>(global_MC, new_MC, device_Seam, width, height);

//measure end time for cpu calcuations
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );

execTime += elapsed_time_ms;

//copy out the data back to host (RESULT)
for(int i=0;i<width;i++)
{
    cudaMemcpy(newE[i], new_MC2[i], sizeof(int)*height, cudaMemcpyDeviceToHost);
}

return newE;
}

I looped it 800 times and I got the follow results:

GPU Computation Time (the gpu_shiftseam part) : 1176ms Total program run time: 22s

CPU Computation Time (same operation as gpu_shiftseam but on host) : 12522ms Total program run time: 12s

Apparently the GPU computation time is way shorter than the one on CPU, but for some reason the total program run time for gpu is a lot longer, does anyone know why? Is it because of the number of threads/blocks I am assigning is incorrect? Or is the "slowness" coming from allocating memory on device?

Thanks a lot!

2
Just move the timers around, or create more timers, so that you can see where the time goes. Maybe the time is used in the cudaMemcpy() calls. - Roger Dahl
what if the time is used in cudaMemcpy() calls like you said? Does that mean that it is inevitable to spend this much time using that function? Because I dont think theres an alternative to cudaMemcpy() - user1360113
Before optimizing I would time the subsections of kernel_shiftSeam or use one of the profilers (Parallel Nsight, CUDA profiler, NVIDIA Visual Profiler). - Greg Smith
Items that are causing overhead include but are not limited to: (a) CPU memory copy to fix direction, (b) malloc, (c) cudaMalloc, (d) CPU memory copy in cudaMemcpy due to copying non-pinned memory to the GPU, (e) blocking call to cudaMemcpy, unnecessary cudaEventSynchronize (move to the end), division of cudaMemcopy into many calls. You can look into using cudaMallocPitch/cudaMemcpy2D to handle 2D copies. You are also leaking both CPU and GPU memory which on WDDM will slow down each successive GPU operation. - Greg Smith

2 Answers

2
votes

Im my experience memory accesses are the #1 reason for slowness.

Profile your array copies to see how much time is being spent. If it is a considerable amount, perhaps try optimizing your code. Instead of copying inside of a for-loop, perhaps see if you can copy sizeof(int *) * height * width directly. Reducing the amount of times you call memcpy should help.

cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
cudaMemcpy(global_MC2, global_host_MC, sizeof(int)*height*width,cudaMemcpyHostToDevice);
0
votes

I had similar experience and found that cudaMalloc was the bottleneck while cudaMemcpy wasn't. In my device, I remember that 16 MB allocation took 160 ms. CUDA memory allocation however can be done before actual computation, for example, by another function call. Thus, the memory allocation time can be removed from overall performance measure, e.g., speedup although I would include cudaMemcpy operation in the speedup calculation.