Interpreting the verbose output of ptxas, part I

Question

I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.

I compiled my kernel.cu file to a kernel.o file with nvcc -arch=sm_20 -ptxas-options=-v

and I got the following output (passed through c++filt):

ptxas info    : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info    : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]

Looking at the output above, is it correct to say that

each CUDA thread is using 46 registers?
there is no register spilling to local memory?

I am also having some issues with understanding the output.

My kernel is calling a whole lot of __device__ functions. IS 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions?
What is the difference between 0 byte spill stores and 0 bytes spill loads
Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?

_{This question is "continued" in: Interpreting the verbose output of ptxas, part II}

'Used 46 registers' indicates the compiler has reserved 46 registers per thread for the compiled kernel and the other registers are spilled. You can find the number of spilled registers by subtracting this number (46) from the total number of register used in the kernel's PTX. — lashgar
@Ahmad: You first sentence is correct, but the second is not. A kernel can use less than the maximum permissible registers per thread and have no spills to local memory. — talonmies
To elaborate on talonmies reply, PTX is a high-level abstraction with infinite registers. That's because it can be compiled for multiple generations of GPU and the number of registers can be different. It's only when you compile down to the machine specific code that you can really look at register use. In any case, ptxas (compiling PTX to the machine-specific code) tells you the amount of spills. — Tom
compiler also uses constant memory for numeric constants (if they are too large to be hardcoded in the instruction opcode). Though I am not sure why cmem is repeated twice if you say that you do not use constant memory yourself in the program — user1545642

Tom Tom · Accepted Answer · 2012-09-12T13:31:16

Each CUDA thread is using 46 registers? Yes, correct
There is no register spilling to local memory? Yes, correct
Is 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions? Yes, correct
What is the difference between 0 byte spill stores and 0 bytes spill loads?
- Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
- Constant memory is used for a few purposes including __constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.

Interpreting the verbose output of ptxas, part I

1 Answers