18
votes

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.

There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.

How do you identify ways to make your CUDA kernels perform faster?

4

4 Answers

22
votes

If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.

Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:

  • Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
  • Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
  • Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
  • Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)

This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.

0
votes

The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.

Maybe you could post your kernel code here and get some feedback ?

The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.

0
votes
0
votes

I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.

To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:

  • Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.

  • Running it on a CUDA processor, and doing the same thing, if possible.

What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.