1
votes

I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.

However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.

As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.

The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).

From now on, I will describe my interpretation of the profiler.


When profiling, if I run Examine GPU usage I get (click to enlarge):

enter image description here

From here I see several things:

  • Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
  • Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
  • Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.

From the kernel executions "bars", top and right I can see:

I follow my profiling by running Perform Kernel Analysis, getting:

enter image description here

I can see here that

  • Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
  • Most of the time is in computing and L2 cache reading.

Something else?

I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.

enter image description here

The biggest 3 stall reasons seem to be

  • Memory dependency. Too many texture memreads? But I need this amount of memreads.
  • Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g. a=a+1;a=a*a;b=b+1;b=b*b; to a=a+1;b=b+1;a=a*a;b=b*b;?
  • Instruction fetch (????)

Questions:

  • Are there more additional tests I can perform to understand better my kernels execution time limitations?
  • Is there a ways to profile in the instruction level inside the kernel?
  • Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
  • If I were to start trying to optimize the kernel, where would I start?
1

1 Answers

3
votes

Are there more additional tests I can perform to understand better my kernels execution time limitations?

Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').

Is there a ways to profile in the instruction level inside the kernel?

Yes, you have available two types of profiling:

  1. 'Kernel Profile - Instruction Execution'
  2. 'Kernel Profile - PC Sampling' (Only in Maxwell)

Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?

You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.

If I were to start trying to optimize the kernel, where would I start?

I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.