FLOP efficiency in CUDA

Question

According to the definition of flop_sp_efficiency

Ratio of achieved to peak single-precision floating-point operations

The CUDA manual covers FLOPS, here. The metric yields ratio, e.g. 10%. That raises two questions about the term "peak":

1- Is that a hardware specific value? Therefore, nvprof should be aware of that in order to calculate the ratio and the denominator should be constant for all applications run on a specific device? According to the manual, that is No_CUDA_cores * Graphic_clock_freq * 2. Is that the way that nvprof set the denominator?

2- Does that mean the peak value is achieved during the runtime of the program per kernel? Assume a kernel is invoked 10 times. One invocation has the highest FLOPS (has no relation with the hardware value), e.g. 2GFLOPS. Then the efficiency is calculated as sum(FLOPS_i)/10 which gives the average FLOPS of 10 invocations and then this average is divided by 2 and that yields the FLOPS efficiency for that kernel. With this assumption, a kernel may reaches 2 GFLOPS while another kernel may reach 4 GFLOPS. I say that because the metric is reported per kernel in nvprof.

Any comment on that?

Greg Smith Greg Smith · Accepted Answer · 2019-04-13T02:13:58

NVPROF (and other CUDA profilers) calculate FLOPS by replaying the kernel two times. In one pass the tool collects time and SM elapsed cycles. In the second pass the tool modifies the kernel to calculate the total number of FLOPS.

SMCOUNT = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT

flops_sp_efficiency = flop_count_sp / (elapsed_cycles_sm * SM_MAX_FLOP_PER_CYCLE)
SM_MAX_FLOP_PER_CYCLE = CUDA_CORES x 2 / SMCOUNT

flops = flop_count_sp / gpu__time_duration x NANOSECONDS_PER_SECOND

gpuclk_hz = elapsed_cycles_sm / SMCOUNT / gpu_time_duration x NANOSECONDS_PER_SECOND

elapsed_cycles_sm is the number of cycles elapsed in the SM clock domain summed across all SMs. The SM clock domain is the same as the graphics clock.

NVPROF has neither and event nor a metric for time duration. Time duration can be captured in NVPROF using the trace activity. In Perfworks the metric gpu__time_duration is the wall clock duration of the kernel.

Nsight VSE CUDA Profiler allows the developer to customize the weights per instruction or define completely new experiments using SASS regex. See https://docs.nvidia.com/nsight-visual-studio-edition/Nsight_Visual_Studio_Edition_User_Guide.htm#Analysis/Report/CudaExperiments/KernelLevel/AchievedFlops.htm

ANSWER 1 - Yes, the tools use real-time measurement to determine the theoretical maximum. This is calculated in a replay of the kernel.

ANSWER 2 - The metric is collected for each execution of a kernel. NVPROF (but not other tools) rolls up the metrics over kernels with the same function name using a non-weighted average.

FLOP efficiency in CUDA

2 Answers