Profile C application with mixed CUDA

Question

I have a C program with a major function that takes about 70% of total runtime. I used gprof to profile the application. After that, I rewrote that particular function in CUDA to boost the runtime of the whole application. It's currently giving results correctly but I want to know about the performance.

Is there anyway (or tool) I can use to profile this new application with the runtime of the new kernel as percentage of runtime with respect to the whole new application? I want to see the data relating all other remaining C functions as well. I tried using nvprof but it only outputs the runtimes of the CUDA kernels.

Thanks,

If you have a program whose function is to perform a specific calculation, that calculation should take most of the time in the program. Just because you optimize it doesn't mean it will take less percentage of the run-time, it might actually take more of the run-time percentage as it becomes faster and therefore can do more work in the same time-frame. — Some programmer dude
@JoachimPileborg I don't follow your logic. If you reduce the time spent in one portion of your application, t1opt < t1, (e.g. by parallelisation) and the remainder, t2, remains constant then the percentage time spent in the optimised part will be lower, t1opt/(t1opt+t2) < t1/(t1+t2). Amdahl's Law captures that fairly well. — Tom
@Tom Imagine you have an application which does a calculation X times per second, and that calculation accounts for 50% of the applications run-time. Now if the calculation is optimized to halve the time spent in it, you could halve the run-time percentage of the calculation to 25% of the applications total run-time, but what would you do with the other 25%? Just be idle? Why not instead let the calculations still account for 50% of the run-time, but instead do double the number of calculations in the same time-frame? — Some programmer dude
If you're comfortable with gprof, you can put a thin C wrapper function around your kernel call, and profile with gprof. You may want to put a cudaDeviceSynchronize() call at the end of your wrapper, depending on your exact code in the wrapper. — Robert Crovella
@JoachimPileborg using your example if you halve the time spent in the function you don't halve the percentage, it's not like your application is going to idle for the saved time it will finish earlier! What you're describing is something different (akin in some respects to weak scaling) since you're trying to increase the work in a fixed amount of time, whereas the original question is about doing a fixed amount of work in a reduced amount of time. — Tom

Tom Tom · Accepted Answer · 2014-07-29T08:21:24

You can use the NVIDIA profiling tools to give you this information.

Running the command line tool nvprof <app> will give you the percentage and you can use additional command line options to optimise your kernel further. The visual profiler (nvvp) will show you the timeline and also the percentage time spent in the kernels, and it will also give you guidance on how to further improve the performance (including linking back to the documentation to explain concepts).

See the documentation for more info.

ADDENDUM

In your comment you say that you want to see the profile of the C functions as well. One way to do that would be to use nvtx to annotate your code, see this blog post for a way to automate that task. Alternatively you could profile in nvprof or nvvp to see the overall timeline and profile in gprof to see time spent in non-GPU code.

Profile C application with mixed CUDA

2 Answers