According to the definition of flop_sp_efficiency
Ratio of achieved to peak single-precision floating-point operations
The CUDA manual covers FLOPS, here. The metric yields ratio, e.g. 10%. That raises two questions about the term "peak":
1- Is that a hardware specific value? Therefore, nvprof should be aware of that in order to calculate the ratio and the denominator should be constant for all applications run on a specific device? According to the manual, that is No_CUDA_cores * Graphic_clock_freq * 2
. Is that the way that nvprof set the denominator?
2- Does that mean the peak value is achieved during the runtime of the program per kernel? Assume a kernel is invoked 10 times. One invocation has the highest FLOPS (has no relation with the hardware value), e.g. 2GFLOPS. Then the efficiency is calculated as sum(FLOPS_i)/10
which gives the average FLOPS of 10 invocations and then this average is divided by 2 and that yields the FLOPS efficiency for that kernel. With this assumption, a kernel may reaches 2 GFLOPS while another kernel may reach 4 GFLOPS. I say that because the metric is reported per kernel in nvprof.
Any comment on that?