2
votes

The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]

Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).

The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)

I'm using all my shared mem space. I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.

  1. Why can't I use a full block of threads (1024)?
  2. Any guess as to which resource I'm running out of?
  3. I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
1
What hardware is this running on? And what CUDA toolkit version?talonmies
HW: properties.name= Quadro 20000M Toolkit: nvcc -version = Cuda compilation tools, release 4.1, V0.2.1221Doug
It doesn't seem to matter how many blocks, or how much shared mem I use.Doug
The number of PTX registers is not equal to the number of registers used by the launch. It appears this information can no longer be easily captured using the binutils. Nsight Visual Studio CUDA Trace Activity, the CUDA profiler, and the Visual Profiler will display the actual registers per thread, static shared memory per block, and dynamic shared memory per block. I need to see if this information is still output if the launch fails to launch due to resource limitation check.Greg Smith

1 Answers

2
votes

When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.

So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers. The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.

So it looks like I'm running out of register space.