1
votes
  1. Let's say your GPU has 8 SMs. So if you execute a CUDA kernel with enough blocks(let's say 200), will all 8 SMs be used for the execution?

Now consider only a single SM. Let's assume there are 8 active blocks with 256 threads/block(8 warps/block). Max active warps=64.

  1. Will the 8 active blocks process in parallel once the kernel started?

  2. I know that the warps will be scheduled by the scheduler in each SM. Which means warps will not execute in parallel but concurrently.

enter image description here

Here is my real problem. I am experiencing a low latency issue with a particular kernel. Here is the limiting factors. enter image description here I just want to know what is the optimum adjustments for this case. Because if the active blocks are not executing atleast concurrently, there is no point in increasing the active blocks count. Since, having a minimum number of active blocks with 64 active warps will perform better (ignore the register limitation as I can adjust it accordingly).

1
1. Yes, 2. Depends on the kernel and resources required by kernel and available on your GPU. 3. Up to 2048 threads may execute concurrently depending on the data availability, but not all instructions at the same time : interleaving executable warp instructions of each warp. Think of feasibility of sync threads.Florent DUGUET
Your scenario related to (2) and (3) isn't possible when running a single kernel.talonmies
Why did you say that the 2nd and 3rd isn't possible? - talonmiesAkila D. Perera
I accept the answer for 1 and 3. But in the 2nd I tried to mention that there are 12 active blocks which means there are enough registers, shared memory and maximum thread count to have 12 active blocks. I want know whether those 12 blocks also execute concurrently like warps or in parallel like SMs - Florent DUGUETAkila D. Perera
Change (2) to 8 blocks of 8 warps/block or 12 blocks of 4 warps. Assuming all resource constraints are met all blocks/warps will be resident on a SM 3.0-7.0 at the same time and the each SM warp scheduler (4 per SM) will be allocated 1/4 the warps. On each cycle the warp schedulers will pick the most eligible active warp and execute 1-2 instructions (depending on the architecture). Maximum instruction issue parallelism for 1 SM is 4 warps. Maximum instruction in flight parallelism is equal to the SM limit of 64 warps.Greg Smith

1 Answers

2
votes

Assuming all resource constraints are met all blocks/warps will be resident on a SM 3.0-7.0 at the same time and the each SM warp scheduler (4 per SM) will be allocated 1/4 the warps. On each cycle the warp schedulers will pick the most eligible active warp and execute 1-2 instructions (depending on the architecture). Maximum instruction issue parallelism for 1 SM is 4 warps. The maximum warp parallelism for instructions in flight is the SM limit of 64 warps.

The optimal number of warps per SM will vary with the instruction mix, resource requirements, and memory access patterns. The profiler can be used to determine if the configuration has sufficient warps to hide latency. Increasing warps sacrifices registers but increases potential latency hiding. Increasing warps per block can increase data sharing between warps but can result in lower achieved occupancy if the kernel has a tail effect or can result in lower eligible warps if barriers are heavily used. In these cases reducing the warps per block is recommended. If the kernel is not using shared memory then smaller block sizes (256 threads/block) are recommended.