CUDA Kernel register size

Question

On a compute capablility 1.3 GPU cuda card, we run the following code

for(int i=1;i<20;++i)

kernelrun<<<30,320>>>(...);

we know that each SM has 8 SP and can run 1024 threads, so there are 30 SM in tesla C1060 which can run 30*1024 threads concurrently.

As per the given code, how many threads can run concurrently ? If there are 48 registers for the kernelrun kernel , what are the limitations on tesla C1060?

which has 16384 registers and 16KB shared memory?

Since concurrent kernel execution is not supported in Tesla C1060, how can we execute the kernel in loop concurrently ? IS streams possible? only one concurrent copy and execute engine in tesla C1060?

Just to point out: You have 30 SM and each SM has 8 SP. This means you can run only 240 threads concurrently. Rest will be queued . Period — Programmer
@Programmer: CUDA works like an SIMD architecture with a vector size equal to the warp size, ie 32. A compute 1.x device single issues a warp at a time per SM, and the architecture is pipelined (about 21 instructions). So at any given time a GT200 could have 32 threads per SM on the ALUs, retiring at a maximum rate of every fourth clock, plus more doing data prefetch plus even more on the register file. The total number of threads in some stage of execution simultaneously could be up to 128 warps per SM — talonmies

talonmies talonmies · Accepted Answer · 2011-12-15T17:00:36

NVIDIA have been shipping an Occupancy calculator which you can use to answer this question for yourself since 2007. You should try it.

But to answer your question, each SM in your compute 1.3 device has 16384 registers per SM, so the number of threads per block if your kernel is register limited would be roughly 352 (16384/45 rounded down to the nearest 32). There is also a register page allocation granularity to consider.

CUDA Kernel register size

1 Answers