Basically, what you're measuring as your CPU time is the time it takes to
- record the first event,
- set up the kernel launch with the respective parameters,
- send the necessary commands to the GPU,
- launch the kernel on the GPU,
- execute the kernel on the GPU,
- wait for the notification that GPU execution finished to get back to the CPU, and
- record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()
s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime()
to get the time between the two events.
Also, note that gettimeofday()
is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock
, or std::high_resolution_clock
(I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).