resident warps per SM in (GK20a GPU) tegra k1

Question

How many resident warps are present per SM in (GK20a GPU) tegra k1?

As per documents I got following information In tegra k1 there is 1 SMX and 192 cores/multiprocessor

Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Can any one specify value of maximun blocks per SMX?

Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?

Kindly help me to solve and understand it.

Robert Crovella Robert Crovella · Accepted Answer · 2014-07-22T14:45:48

Can any one specify value of maximun blocks per SMX?

The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.

Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?

There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".

Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).
Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).
Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.

So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.

However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.

resident warps per SM in (GK20a GPU) tegra k1

2 Answers