Performance Analysis of Multiple Kernels (CUDA C)

Question

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.

But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program. Should I take the (average or largest value or total) of all kernels for each metric??

I would use a weighted average, where the weighting factor is the kernel execution time over the sum of all kernel execution times. — Robert Crovella
Thank you for your reply I'm needing that a lot. According to my understanding, if I have 3 kernels and I want to compute the overall achieved occupancy for them while I have the occupancy values for each one separately: 1- I have to first compute the weighted factor for each kernel. 2- then multiply this value by achieved occupancy for each value? sorry about the confusion but then how can I compute the overall achieved occupancy? — Sarah Hamed

Robert Crovella Robert Crovella · Accepted Answer · 2018-11-07T16:00:40

One possible approach would be to use a weighted average method.

Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.

Let's also suppose that the profiler reports the gld_efficiency metric as follows:

kernel     duration    gld_efficiency
     1        10ms               88%
     2        20ms               76%
     3        30ms               50%

You could compute the weighted average as follows:

                                     88*10        76*20        50*30
"overall"  global load efficiency =  -----   +    -----    +   ----- = 65%
                                       60           60           60

I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:

kernel     gld_transactions    gld_efficiency
     1        1000               88%
     2        2000               76%
     3        3000               50%


                                     88*1000        76*2000        50*3000
"overall"  global load efficiency =  -------   +    -------    +   ------- = 65%
                                       6000           6000           6000

Performance Analysis of Multiple Kernels (CUDA C)

1 Answers