3
votes

In all the papers i am reading i see that the GPU is made up of multiprocessors and each multiprocessor has 8 processors which are capable of executing a single warp in parallel.
The GPU i am using is Nvidia 560, it has only 7 multiprocessors but 48 processors in each multiprocessor. does this mean that every multiprocessor in the Nvidia 560 is able to execute 6 warps in parallel?
Can i say that the max number of threads executed in parallel on Nvidia 560 is 32*6*7=1344 threads in parallel? (32=warp , 7=multipricessors , 6=warps executed in parallel)

How many multiprocessors is in the fastest Nvidia GPU? what is this GPU? What is the maximum amount of global memory does the biggest GPU have?

2

2 Answers

3
votes

From CUDA Programming Guide 4.2:

[...] at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.

So, the maximum number of concurrent running waprs per SM is equal to the number of warp schedulers (WS).

GeForce 580 has 2.1 architecture:

For devices of compute capability 2.x, a multiprocessor consists of: [...] 2 warp schedulers

This means, each SM of your GPU can run 2 warps = 64 threads concurrently, making it 448 threads total. Please note, however, that it's highly recommended to use much more threads than that:

The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency, and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely “hidden”.

Regarding your other questions: GeForce GTX690 has 3072 CUDA Cores. However, for CUDA it would seem like two separate GPUs with 1536 cores each, so it's not better then two GeForce 680, and the latter is easily overclocked judging by numerous online reviews. The largest memory among GPUs is installed in nVidia Tesla M2090: 6GiB of GDDR5 (512 CUDA Cores). I guess, soon the new family of Teslas, based on Kepler architecture like GeForce 6xx, will be released, but I haven't heard of any official announces.

2
votes

The papers you are reading are old. The first two generations of CUDA GPUs had 8 cores per MP, and issues instructions from single warp (if you want to simplify, each instruction gets executed four times on 8 cores to service a single warp).

The Fermi card you have is newer and different. It "dual-issues" instructions from two different warps per multiprocessor (so each warp instruction is executed twice on 16 cores). When the code stream allows it, an additional instruction from one of those two warps can be issued onto the remaining 16 cores, ie. a limited form of out-of-order execution. This latter feature is only available on compute capability 2.1 devices. On compute capability 2.0 devices, there are only 32 cores per multiprocessor. But the number of warps per MP retiring instructions per multiprocessor on any given shader clock cycle is two is both cases. Note that there is a rather deep instruction pipeline, so there is considerable latency between issue and retirement and up to 48 are active per multiprocessor at any instant in time.

So your answer is either 14 warps or 336 warps on 7 multiprocessors in your GTX 560, depending on which definition of "executed in parallel" you wish to adopt. The information I used to answer this mostly comes from Appendix F of the current Programming Guide.