cuda - minimal example, high register usage

Question

Consider these 3 trivial, minimal kernels. Their register usage is much higher than I expect. Why?

A:

__global__ void Kernel_A()
{  
//empty
}

corresponding ptx:

ptxas info    : Compiling entry function '_Z8Kernel_Av' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_Av
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 32 bytes cmem[0]

B:

template<uchar effective_bank_width>
__global__ void  Kernel_B()
{
//empty
}

template
__global__ void  Kernel_B<1>();

corresponding ptx:

ptxas info    : Compiling entry function '_Z8Kernel_BILh1EEvv' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_BILh1EEvv
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 32 bytes cmem[0]

C:

template<uchar my_val>
__global__ void  Kernel_C
        (uchar *const   device_prt_in, 
        uchar *const    device_prt_out)
{ 
//empty
}

corresponding ptx:

ptxas info    : Compiling entry function '_Z35 Kernel_CILh1EEvPhS0_' for 'sm_20'
ptxas info    : Function properties for _Z35 Kernel_CILh1EEvPhS0_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 48 bytes cmem[0]

Question:

Why did empty kernels A and B use 2 registers? CUDA always uses one implicit register, but why are 2 additional explicit registers used?

Kernel C is even more frustrating. 10 registers? But there are only 2 pointers. This gives 2*2 = 4 registers for the pointers. Even if there are additionally 2 mysterious registers (suggested by Kernel A and Kernel B), this would give 6 total. Still much less than 10 !

In case you are interested, here is the ptx code for Kernel A. The ptx code for Kernel B is exactly the same, modulo the integer values and variable names.

.visible .entry _Z8Kernel_Av(    
)
{           
        .loc 5 19 1
func_begin0:
        .loc    5 19 0

        .loc 5 19 1

func_exec_begin0:
        .loc    5 22 2
        ret;
tmp0:
func_end0:
}

And for Kernel C...

.weak .entry _Z35Kernel_CILh1EEvPhS0_(
        .param .u64 _Z35Kernel_CILh1EEvPhS0__param_0,
        .param .u64 _Z35Kernel_CILh1EEvPhS0__param_1
)
{
        .local .align 8 .b8     __local_depot2[16];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .s64       %rd<3>;


        .loc 5 38 1
func_begin2:
        .loc    5 38 0

        .loc 5 38 1

        mov.u64         %SPL, __local_depot2;
        cvta.local.u64  %SP, %SPL;
        ld.param.u64    %rd1, [_Z35Kernel_CILh1EEvPhS0__param_0];
        ld.param.u64    %rd2, [_Z35Kernel_CILh1EEvPhS0__param_1];
        st.u64  [%SP+0], %rd1;
        st.u64  [%SP+8], %rd2;
func_exec_begin2:
        .loc    5 836 2
tmp2:
        ret;
tmp3:
func_end2:
}

Why does it first declare a local-memory variable (.local) ?
Why are the two pointers (given as function arguments) stored in registers? Isn't there a special param space for them?
Perhaps the two function argument pointers belong in registers - that explains the two .reg .b64 lines. But what is the .reg .s64 line? Why is it there?

It gets worse still:

D:

template<uchar my_val>
__global__ void  Kernel_D
        (uchar *   device_prt_in, 
        uchar *const    device_prt_out)
{ 
    device_prt_in = device_prt_in + blockIdx.x*blockDim.x + threadIdx.x;
}

gives

ptxas info    : Used 6 registers, 48 bytes cmem[0]

So manipulating the argument (a pointer) decreases from 10 to 6 registers?

talonmies talonmies · Accepted Answer · 2013-06-20T16:09:58

The first point to make is that if you are worried about registers, don't look at PTX code, because it won't tell you anything. PTX uses static single assignment form and the code emitted by the compiler doesn't include any of the "decoration" required to make a runnable machine code entry point.

With that out of the way, let's look at kernel A:

$ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z8Kernel_Av' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_Av
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 32 bytes cmem[0]

$ cuobjdump -sass null.cubin 

    code for sm_20
        Function : _Z8Kernel_Av
    /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
    /*0008*/     /*0x00001de780000000*/     EXIT;
        .............................

There are your two registers. Empty kernels don't produce zero instructions.

Beyond that, I can't reproduce what you have shown. If I look at your kernel C as posted, I get this (CUDA 5 release compiler):

$ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z8Kernel_CILh1EEvPhS0_' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_CILh1EEvPhS0_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 48 bytes cmem[0]


$ cuobjdump -sass null.cubin 

code for sm_20
    Function : _Z8Kernel_CILh1EEvPhS0_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x00001de780000000*/     EXIT;
    ........................................

ie. identical 2 register code to the first two kernels.

and the same for Kernel D:

$ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z8Kernel_DILh1EEvPhS0_' for 'sm_20'
ptxas info    : Function properties for _Z8Kernel_DILh1EEvPhS0_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 48 bytes cmem[0]

$ cuobjdump -sass null.cubin 
code for sm_20
    Function : _Z8Kernel_DILh1EEvPhS0_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x00001de780000000*/     EXIT;
    ........................................

Again, 2 registers.

For the record, the nvcc version I am using is:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_28_16:10:16_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

cuda - minimal example, high register usage

1 Answers