1
votes

I have a C program with a major function that takes about 70% of total runtime. I used gprof to profile the application. After that, I rewrote that particular function in CUDA to boost the runtime of the whole application. It's currently giving results correctly but I want to know about the performance.

Is there anyway (or tool) I can use to profile this new application with the runtime of the new kernel as percentage of runtime with respect to the whole new application? I want to see the data relating all other remaining C functions as well. I tried using nvprof but it only outputs the runtimes of the CUDA kernels.

Thanks,

2
If you have a program whose function is to perform a specific calculation, that calculation should take most of the time in the program. Just because you optimize it doesn't mean it will take less percentage of the run-time, it might actually take more of the run-time percentage as it becomes faster and therefore can do more work in the same time-frame.Some programmer dude
@JoachimPileborg I don't follow your logic. If you reduce the time spent in one portion of your application, t1opt < t1, (e.g. by parallelisation) and the remainder, t2, remains constant then the percentage time spent in the optimised part will be lower, t1opt/(t1opt+t2) < t1/(t1+t2). Amdahl's Law captures that fairly well.Tom
@Tom Imagine you have an application which does a calculation X times per second, and that calculation accounts for 50% of the applications run-time. Now if the calculation is optimized to halve the time spent in it, you could halve the run-time percentage of the calculation to 25% of the applications total run-time, but what would you do with the other 25%? Just be idle? Why not instead let the calculations still account for 50% of the run-time, but instead do double the number of calculations in the same time-frame?Some programmer dude
If you're comfortable with gprof, you can put a thin C wrapper function around your kernel call, and profile with gprof. You may want to put a cudaDeviceSynchronize() call at the end of your wrapper, depending on your exact code in the wrapper.Robert Crovella
@JoachimPileborg using your example if you halve the time spent in the function you don't halve the percentage, it's not like your application is going to idle for the saved time it will finish earlier! What you're describing is something different (akin in some respects to weak scaling) since you're trying to increase the work in a fixed amount of time, whereas the original question is about doing a fixed amount of work in a reduced amount of time.Tom

2 Answers

3
votes

You can use the NVIDIA profiling tools to give you this information.

Running the command line tool nvprof <app> will give you the percentage and you can use additional command line options to optimise your kernel further. The visual profiler (nvvp) will show you the timeline and also the percentage time spent in the kernels, and it will also give you guidance on how to further improve the performance (including linking back to the documentation to explain concepts).

See the documentation for more info.

ADDENDUM

In your comment you say that you want to see the profile of the C functions as well. One way to do that would be to use nvtx to annotate your code, see this blog post for a way to automate that task. Alternatively you could profile in nvprof or nvvp to see the overall timeline and profile in gprof to see time spent in non-GPU code.

0
votes

Well, you might know that I'm partial to this technique.

It will tell you approximate percentages spent by functions, lines of code, anything you can identify. I assume your main program at some point has to wait for the CUDA kernels to finish processing, so the fraction of samples ending in that wait gives you an estimate of the time spent in CUDA. Samples not ending in that wait, but doing other things, indicate the time spent doing those other things.

The statistics are pretty simple. If a line of code or function is on the stack for fraction F of time, then it is responsible for that fraction of time. So if you take N samples, the number of samples showing the line of code or function are, on average, NF. The standard deviation is sqrt(NF(1-F)). So if F is 50% or 0.5, and you take 20 random stack samples, you can expect to see the code on 10 of them, with a standard deviation of sqrt(20*0.5*0.5) = 2.24 samples, or somewhere between 7 and 13 samples, most likely between 9 and 11. In other words, you get a very rough measurement of the code's cost, but you know precisely what code has that cost. If you're concerned about speed, you mainly care about the things that have a big cost, so it's good at pointing those out to you.

If you want to know why gprof doesn't tell you those things, there's a lot more to say about that.