1
votes

background:

I have a kernel that I measure with windows QPC (264 nanosecond tick rate) at 4ms. But I am a friendly dispute with a colleague running my kernel who claims is takes 15ms+ (we are both doing this after warm-up with a Tesla K40). I suspect his issue is with a custom RHEL, custom cuda drivers, and his "real time " thread groups , but i am not a linux expert. I know windows clocks are less than perfect, but this is too big a discrepancy. (besides it all our timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.

question:

Anyway is there a way with CudeDeviceEvents (elapsed time) to add to the CUDA kernel to measure the ENTIRE kernel time from when the first block starts to the end of of the last block? I think this would get us started in figuring out where the problem is. From my reading, it looks like cuda device events are done on the host, and I am looking for something internal to the gpu.

1
The CUDA profiler should be able to give you an accurate assessment of pure kernel execution time. For this case, you can just use the minimal profiler built into the driver. Export the environment variable CUDA_PROFILE=1, run your app, then inspect the generated log file. Make sure to unset the profiler environment variable once you are done with the measurements.njuffa
Before I dive deep into the profiler, can you tell me it it internally does (a) emulation of the PTX, (b) statistical sampling for timing (c) code insertion which puts stuff out to logs. These are the three main ways profilers work, but each have different effect on the timing, and in our case, we are not dealing with a classic compute-bound problem, but a data-bound issue, and it will make a difference which way things are done.Dr.YSG
I have no knowledge of the internal workings of the profiler, but it is not (a). PTX is an intermediate code representation. It is compiled to machine code (SASS), which is what executes on the GPU. The profiler can tell you about the properties of memory accesses in the code based on HW performance counters. There are some simple strategies for memory-bound kernels, the Best Practices Guide should be a good starting point.njuffa

1 Answers

2
votes

The only way to time execution from entirely within a kernel is to use the clock() and clock64() functions that are covered in the programming guide.

Since these functions sample a per-multiprocessor counter, and AFAIK there is no specified relationship between these counters from one SM to the next, there is no way to determine using these functions alone, which thread/warp/block is "first" to execute and which is "last" to execute, assuming your GPU has more than 1 SM. (Even if there were a specified relationship, such as "they are all guaranteed to be the same value on any given cycle", you would still need additional scaffolding as mentioned below.)

While you could certainly create some additional scaffolding in your code to try to come up with an overall execution time (perhaps adding in atomics to figure out which thread/warp/block is first and last), there may still be functional gaps in the method. Given the difficulty, it seems that the best method, based on what you've described, is simply to use the profilers as discussed by @njuffa in the comments. Any of the profilers can provide you with the execution time of a kernel, on any supported platform, with a trivial set of commands.