2
votes

My kernel has the ptx version like this:

.version 2.2
.target sm_20, texmode_independent

.entry histogram(
        .param .u32 .ptr .global .align 4 histogram_param_0,
        .param .u32 .ptr .global .align 4 histogram_param_1
)
{
        .reg .f32       %f<2>;
        .reg .s32       %r<12>;

_histogram:
        mov.u32         %r1, %tid.x;
        mov.u32         %r2, %envreg3;
        add.s32         %r3, %r1, %r2;
        mov.u32         %r4, %ctaid.x;
        mov.u32         %r5, %ntid.x;
        mad.lo.s32      %r6, %r4, %r5, %r3;
        shl.b32         %r7, %r6, 2;
        ld.param.u32    %r8, [histogram_param_0];
        add.s32         %r9, %r8, %r7;
        ld.param.u32    %r10, [histogram_param_1];
        ld.global.f32   %f1, [%r9];
        add.s32         %r11, %r10, %r7;
        st.global.f32   [%r11], %f1;
        ret;
}

I as I counted, there are only 13 instructions in my kernel (not including the ret instruction). When I set the number of work items to be 5120, workgroup size is 64. Because there are 16 SMs, in each of which there are 32 scalar processors, so the above code will be executed 10 times in a SM. As I expected the number of executed instructions should be 10*13 = 130. But after I profiled, the results are: issued instructions=130, executed intructions=100. 1. Why is the number of issued instructions different with the number of executed instructions? There is no branches, so aren't they supposed to be equal? 2. Why is the number of executed instruction smaller than expected? Should all the instructions in the ptx version executed at least? 3. Does cache misses (L1 and L2) have any impact on the number of issued instructions and the number of executed instructions? Thanks

2
can you please format this code?Jesus Ramos

2 Answers

2
votes

PTX is only an intermediate representation of compiled code. It is not what the GPU actually executes. There is a further assembly step which emits the code which the GPU runs, this can happen either at compile time, or using JIT compilation in the driver. As a result, your instruction counts and anything you infer from them are invalid.

NVIDIA ship a tool called cuobjdump which can disassemble the assembler output generated for Fermi cards and show the actual machine code run on the GPU

2
votes

Keep in mind that PTX is not exactly what is being executed on the GPU. PTX is merely an intermediate representation. The real code is in .cubin files. That's why making such accurrate calculations based on ptx source code makes no point.

You can use cuobjdump --sass tool provided with CUDA 4.0 to extract the GPU assembly code from .cubin files into something more readable.