CUDA dynamic parallelism and global memory synchronization

Question

I can't figure out the following.

If I launch a kernel and consider, for example, thread 0 in block 0, after a __syncthreads() call, will all the other threads in all the other blocks see the changes made to global memory by thread 0 in block 0?

My guess is no. Indeed, in the the synchronization functions Section of the CUDA C Programming Guide, it is stated that:

void __syncthreads(); waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

However, when talking about global memory consistency in dynamic parallelism, the CUDA C Programming Guide states that:

Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call.

So does __syncthreads() also makes the changes available across blocks when dynamic parallelism is involved?

Thanks

Concerning your first question, your conclusion is correct since CUDA does not allow a robust way to enable synchronization across blocks. There is a discussion on the NVIDIA forum entitled Synchronize all blocks in CUDA which may be of your interest. — Vitality
Concerning memory consistency and referring to the specific example you are mentioning, I think that the guide is simply saying that you need the first __synchthreads() to ensure that global memory of parent and child kernel are consistent, since all device-side kernel launches are asynchronous (this is mentioned in the CUDA Dynamic Parallelism Programming Guide). The second __synchthreads() call is needed to ensure global memory consistency between the kernels inside the same thread block since launched child kernels may take different processing times to finish. — Vitality
Take a look at __threadfence_system() in Memory Fence Functions for a way to make something written by one thread visible on the entire device. — Roger Dahl

Vitality Vitality · Accepted Answer · 2013-11-03T22:24:57

The only action performed by __syncthreads() is that quoted by yourself described in the CUDA C Programming Guide. There is no way in CUDA to synchronize across blocks, apart from the naive approach of dividing the execution of a kernel in multiple kernel launches, with all the drawbacks in terms of performance. Accordingly, the answer to your first question, as also guessed by yourself, is NO.

In the second part of your post, you are referring to a specific example of the CUDA C Programming Guide, namely

__global__ void child_launch(int *data) {
    data[threadIdx.x] = data[threadIdx.x]+1;
}

__global__ void parent_launch(int *data) { 
    data[threadIdx.x] = threadIdx.x;

    __syncthreads();

    if (threadIdx.x == 0) {
        child_launch<<< 1, 256 >>>(data);
        cudaDeviceSynchronize();
    }

    __syncthreads();
}

void host_launch(int *data) {
    parent_launch<<< 1, 256 >>>(data);
}

Here, all the 256 threads of the parent_launch kernel write something in data. After that, thread 0 invokes child_launch. The first __syncthreads() is needed to ensure that all the memory writes have completed before that child kernel invokation. Quoting the guide on this point:

Due to the first __syncthreads() call, the child will see data[0]=0, data[1]=1, ..., data[255]=255 (without the __syncthreads() call, only data[0] would be guaranteed to be seen by the child).

Regarding the second __syncthreads(), the Guide explains that

When the child grid returns, thread 0 is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call.

In that specific example, the second __syncthreads() is redundant since there is an implicit synchronization due to the kernel termination, but the second __syncthreads() becomes needed when other operations must be performed following the child kernel launch.

Finally, concerning the sentence you are quoting in your post:

Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call

please, note that in the specific example there is only one thread block launched by the host_launch function. This perhaps may have somewhat misled you.

There is an interesting discussion (probably even more than one) on the NVIDIA Forum on thread synchronization across blocks entitled

Synchronize all blocks in CUDA

CUDA dynamic parallelism and global memory synchronization

1 Answers