Inter-block synchronization in CUDA

Question

I've searched a month for this problem. I cannot synchronize blocks in CUDA.

I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array.

When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but they can't be synchronized. The problem is: a block on a SM waits for other blocks, so the SM can't get another block.

What can I do? How can synchronize blocks when there are blocks more than the number of SMs?

CUDA-GPU specification: CC. 6.1, 3 SM, windows 10, VS2015, GeForce MX150 graphic card. Please help me for this problem. I used a lot of codes but none of them works.

I can, but when number of SMs and number of blocks equals. It doesn't make sense that there is no way. I need it. — pedram64
It makes perfect sense. The architecture and programming model are basically incapable of this sort of synchronization. If that doesn't work for you, then you either need a different algorithm, or you need to use a different sort of parallel hardware. Just because you need something or think is doesn't make sense doesn't automatically make it is possible — talonmies

Robert Crovella Robert Crovella · Accepted Answer · 2018-12-14T14:41:19

The CUDA programming model methods to do inter-block synchronization are

(implicit) Use the kernel launch itself. Before the kernel launch, or after it completes, all blocks (in the launched kernel) are synchronized to a known state. This is conceptually true whether the kernel is launched from host code or as part of CUDA Dynamic Parallelism launch.
(explicit) Use a grid sync in CUDA Cooperative groups. This has a variety of requirements for support, which you are starting to explore in your other question. The simplest definition of support is if the appropriate property is set (cooperativeLaunch). You can query the property programmatically, using cudaGetDeviceProperties().

Inter-block synchronization in CUDA

1 Answers