If, as several online resources, including this one, have it, the number of instructions executed + number of replays = the number of instructions issued, and if the number of replays is positive, how can a CUDA kernel have the following properties (from nvprof)?
Invocations Avg Min Max Event Name
1 69161760 69161760 69161760 inst_executed
1 37263115 37263115 37263115 inst_issued1
1 19130919 19130919 19130919 inst_issued2
(inst_issued = inst_issued1 + inst_issued2 = 37263115 + 19130919; ratio = inst_executed/inst_issued > 1).
Is
inst_issued = inst_issued1 + inst_issued2
the correct formula for total number of instructions issued? Are there kernel-issued instructions other than *issued1 and *issued2? If so, how can they be profiled?
Online, I'm not seeing any obvious answers to my questions. For instance, my version of nvprof --query-events only yields the above three parameters as possible arguments to --events. There also seems to be no mention of this in the CUDA programming documentation, the link above, or any of the other ten or so links I've read up on that relate to CUDA instruction optimization.
Additional information:
0) I'm running CUDA 5.0, and compiling with nvcc -m64 -arch=sm_30.
1) I'm running a math-only version of my kernel, and since it has no register pressure, the number of global memory accesses are negligible.
2) I do not have access to the nVidia visual profiler, so I'm not sure if it will give me answers different from those above.
Thanks a lot, and apologies in advance if this is silly.