My kernel has the ptx version like this:
.version 2.2
.target sm_20, texmode_independent
.entry histogram(
.param .u32 .ptr .global .align 4 histogram_param_0,
.param .u32 .ptr .global .align 4 histogram_param_1
)
{
.reg .f32 %f<2>;
.reg .s32 %r<12>;
_histogram:
mov.u32 %r1, %tid.x;
mov.u32 %r2, %envreg3;
add.s32 %r3, %r1, %r2;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %ntid.x;
mad.lo.s32 %r6, %r4, %r5, %r3;
shl.b32 %r7, %r6, 2;
ld.param.u32 %r8, [histogram_param_0];
add.s32 %r9, %r8, %r7;
ld.param.u32 %r10, [histogram_param_1];
ld.global.f32 %f1, [%r9];
add.s32 %r11, %r10, %r7;
st.global.f32 [%r11], %f1;
ret;
}
I as I counted, there are only 13 instructions in my kernel (not including the ret instruction). When I set the number of work items to be 5120, workgroup size is 64. Because there are 16 SMs, in each of which there are 32 scalar processors, so the above code will be executed 10 times in a SM. As I expected the number of executed instructions should be 10*13 = 130. But after I profiled, the results are: issued instructions=130, executed intructions=100. 1. Why is the number of issued instructions different with the number of executed instructions? There is no branches, so aren't they supposed to be equal? 2. Why is the number of executed instruction smaller than expected? Should all the instructions in the ptx version executed at least? 3. Does cache misses (L1 and L2) have any impact on the number of issued instructions and the number of executed instructions? Thanks