0
votes

I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.

When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.

Sample code :

gettimeofday(start);
cudaEventRecord(startGPU);

Kernel <<<abc, xyz,stream>>>();
cudaDeviceSynchronize();

cudaEventRecord(stopGPU);
printf("Elapsed GPU time  = ");

gettimeofday(stop);
printf("Elapsed CPU time  = ");

Results I'm getting for the above code :

Elapsed GPU time = 6ms Elapsed CPU time = 15ms

It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.

2
What platform? If Windows with the WDDM driver, the CPU time will additionally include one batch timeout. If the GPU also drives the display, there might be additional delays involved while the GPU is updating the screen.tera
Yes, I was aware of this. But I am using NVIDIA Jetson - TX2.Madhu Soodhan

2 Answers

3
votes

Basically, what you're measuring as your CPU time is the time it takes to

  1. record the first event,
  2. set up the kernel launch with the respective parameters,
  3. send the necessary commands to the GPU,
  4. launch the kernel on the GPU,
  5. execute the kernel on the GPU,
  6. wait for the notification that GPU execution finished to get back to the CPU, and
  7. record the second event.

Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.

Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:

cudaEventRecord(startGPU, stream);       // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream);        // mark end of kernel execution
cudaEventSynchronize(stopGPU);   // wait for results to be available

and then use cudaEventElapsedTime() to get the time between the two events.

Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).

-1
votes

After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.

The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.

NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.