2
votes

How many resident warps are present per SM in (GK20a GPU) tegra k1?

As per documents I got following information In tegra k1 there is 1 SMX and 192 cores/multiprocessor

Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Can any one specify value of maximun blocks per SMX?

Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?

Kindly help me to solve and understand it.

2

2 Answers

4
votes

Can any one specify value of maximun blocks per SMX?

The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.

Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?

There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".

  1. Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).

  2. Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).

  3. Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.

So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.

However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.

0
votes

Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:

  • each thread from certain kernel is using 64 32b registers (256B register memory),
  • kernel is launched with blocks of size 1024 threads,
  • obviously such block would consume 256*1024B of registers on particular SM

I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...

Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.

I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.

As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.

If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.

EDIT: Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.