Timing CUDA kernels

Question

hi every one im currently working on timing some of my CUDA code. I was able to time them using events. My kernel ran for 19 ms. Somehow I find this doubtful because when I ran a sequential implementation of this, it was at around 5000 ms. I know the code should run faster, but should it be this fast?

I'm using wrapper functions to call cuda kernels in my cpp program. Am I supposed to be calling them there or in the .cu file? Thanks!

Speedups of 100x are not very surprising with CUDA. But you should post some code so we can see what you're doing! — user703016
Did you use streams? Did you add cudaDeviceSynchronize() after kernel call and before time measuring in case of default stream usage? — geek
Since the OP is using events, the OP should use cudaEventSynchronize(), not cudaDeviceSynchronize() (the latter will work, but it's a bit of a heavy hammer for timing...). — harrism
How are you calling the cuda kernels from your .cpp file? If you are not using <<<>>>, the CUDA Driver API, or cudaLaunch(), then you are not launching kernels on the device. Posting some example code would help us answer. — harrism
Another doubt to be solved is how are you measuring the time of your sequential version?. Some unfair comparisons measure the full time of the sequential code from console, including times of malloc and free, and even worse, times of reading input data from file. So, compared to the CUDA kernel (which does not include cudaMalloc, cudaFree or even transfer data from CPU--GPU, results in impressive speedups. — pQB

Roger Dahl Roger Dahl · Accepted Answer · 2012-06-18T02:31:51

The obvious way to check if your program is working would be to compare the output to that of your CPU based implementation. If you get the same output, it is working by definition, right? :)

If your program is experimental in such a way that it doesn't really produce any verifiable output then there is a good chance that the compiler has optimized out some (or all) of your code. The compiler will remove code that does not contribute to output data. This can cause, for instance, that the entire contents of a kernel is removed if the final statement that stores the calculated value is commented out.

As to your speedup. 5000ms / 19ms = 263x, which is an unlikely increase, even for algorithms that map perfectly to the GPU architecture.

Timing CUDA kernels

3 Answers