8
votes

What is the difference between using a CPU timer and the CUDA timer event to measure the time taken for the execution of some CUDA code? Which of these should a CUDA programmer use and why?

CPU timer usage would involve calling cudaThreadSynchronize before any time is noted. For noting the time clock() could be used or high-resolution performance counter like QueryPerformanceCounter (on Windows) could be queried.

CUDA timer event would involve recording before and after by using cudaEventRecord. At a later time, the elapsed time would be obtained by calling cudaEventSynchronize on the events, followed by cudaEventElapsedTime to obtain the elapsed time.

2
Did you start writing one question and finishing writing another? I don't understand how the last paragraph fits in with the rest of the question. What is it that you really want to know? Are you attempting to reconcile the output from host and device timer measurements and can't, or something else?talonmies
Talonmies: I have removed the last paragraph. So the question simply is ... as a programmer, I am confused which of these 2 timers to use and why?Ashwin Nanjappa

2 Answers

9
votes

The answer to the first part of question is that cudaEvents timers are based off high resolution counters on board the GPU, and they have lower latency and better resolution than using a host timer because they come "off the metal". You should expect sub-microsecond resolution from the cudaEvents timers. You should prefer them for timing GPU operations for precisely that reason. The per-stream nature of cudaEvents can also be useful for instrumenting asynchronous operations like simultaneous kernel execution and overlapped copy and kernel execution. Doing that sort of time measurement is just about impossible using host timers.

EDIT: I won't answer the last paragraph because you deleted it.

3
votes

The main advantage of using CUDA events for timing is that they're less subject to perturbations due to other system events, like paging or interrupts from the disk or network controller. Also, because the cu(da)EventRecord is asynchronous, there is less of a Heisenberg effect when timing short, GPU-intensive operations.

Another advantage of CUDA events is that they have a clean cross-platform API - no need to wrap gettimeofday() or QueryPerformanceCounter().

One final note: use caution when using streamed CUDA events for timing - if you do not specify the NULL stream, you may wind up timing operations that you did not intend to. There is a good analogy between CUDA events and reading the CPU's timestamp counter, which is a serializing instruction. On modern superscalar processors, the serializing semantics make the timing unambiguous. Also like RDTSC, you should always bracket the events you want to time with enough work that the timing is meaningful (just like you can't use RDTSC to meaningfully time a single machine instruction).