Cuda Parallel execution

0

votes

Can somebody enlighten me on this: are blocks executed in parallel/concurrently in CUDA? In other words if two different blocks try to write to the same global address i.e. globalPtr[12], is there a lost update issue?

(I am asking this as I have read that the parallel execution unit in CUDA is the warp=32 threads.)

cuda

2

votes

Yes, multiple blocks execute in parallel, so accesses to global memory need to be atomic if more than one thread needs to access the same address. This applies whether it's two threads within the same block or two threads in different blocks.

2

votes

Yes, you can get parallel execution between multiple blocks if the CUDA device has multiple warp schedulers.

CUDA devices with compute capability 2.1 have two warp schedulers, so an instruction from two different warps (from the same block or from different blocks, doesn't matter) can execute concurrently.

CUDA devices with compute capability 3.0 have four warp schedulers, and can issue two independent instructions per warp that is ready to execute.

Note that even without concurrent execution between warps, it is advantageous to have multiple blocks available to the scheduler so that if a warp is blocked waiting for a memory operation to complete, the scheduler can switch to another warp for execution so the cores don't sit idle.

The number of warps that can be resident on a core ready for the scheduler to switch to varies by compute capability.

If you only define as many blocks as you have schedulers, you will not be able to achieve the full compute potential of your device. This is particularly true if your code has a lot of memory I/O - one way to "hide" memory latency is to make sure there are enough blocks/warps available so the scheduler(s) always have a ready warp to switch to when one of the warps goes idle waiting for a memory I/O.

Whenever you have multiple warps reading and writing the same memory address, you should use atomic I/O or take a lock, regardless of whether your current hardware can execute multiple warps concurrently. Write-after-write artifacts ("lost updates") can manifest even in task-switched single core execution.

Cuda Parallel execution

2 Answers