I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage I get (click to enlarge):
From here I see several things:
- Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
- Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
- Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.
From the kernel executions "bars", top and right I can see:
- Most of the time is running kernels. There is little memory overhead.
- All kernels take the same time (good)
- The biggest flag is occupancy, below 45% always, being the registers the limiters. However, optimizing occupancy doesn't seem to be always a priority.
I follow my profiling by running Perform Kernel Analysis, getting:
I can see here that
- Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
- Most of the time is in computing and L2 cache reading.
Something else?
I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
- Memory dependency. Too many texture memreads? But I need this amount of memreads.
- Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g.
a=a+1;a=a*a;b=b+1;b=b*b;toa=a+1;b=b+1;a=a*a;b=b*b;? - Instruction fetch (????)
Questions:
- Are there more additional tests I can perform to understand better my kernels execution time limitations?
- Is there a ways to profile in the instruction level inside the kernel?
- Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
- If I were to start trying to optimize the kernel, where would I start?


