20
votes

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.

I have a GTX 550Ti, btw with compute capability 2.1. 4 SMs x 48 cores = 192 CUDA cores.

Ok so what's unclear to me is:

Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...

Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?

On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).

I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!

3

3 Answers

18
votes

The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.

When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.

In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.

Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.

Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.

Answers to each Question

Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?

Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.

Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).

The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.

Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...

You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.

Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?

Yes, warps can run in parallel.

Q: Because in the Fermi architecture it states that 2 warps are executed concurrently

Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.

Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?

Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.

1
votes

I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.

In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.

-2
votes

One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.

Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.

Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.