The only action performed by __syncthreads()
is that quoted by yourself described in the CUDA C Programming Guide. There is no way in CUDA to synchronize across blocks, apart from the naive approach of dividing the execution of a kernel in multiple kernel launches, with all the drawbacks in terms of performance. Accordingly, the answer to your first question, as also guessed by yourself, is NO.
In the second part of your post, you are referring to a specific example of the CUDA C Programming Guide, namely
__global__ void child_launch(int *data) {
data[threadIdx.x] = data[threadIdx.x]+1;
}
__global__ void parent_launch(int *data) {
data[threadIdx.x] = threadIdx.x;
__syncthreads();
if (threadIdx.x == 0) {
child_launch<<< 1, 256 >>>(data);
cudaDeviceSynchronize();
}
__syncthreads();
}
void host_launch(int *data) {
parent_launch<<< 1, 256 >>>(data);
}
Here, all the 256
threads of the parent_launch
kernel write something in data
. After that, thread 0
invokes child_launch
. The first __syncthreads()
is needed to ensure that all the memory writes have completed before that child kernel invokation. Quoting the guide on this point:
Due to the first __syncthreads()
call, the child will see data[0]=0
, data[1]=1
, ..., data[255]=255
(without the __syncthreads()
call, only data[0]
would be guaranteed to be seen by the child).
Regarding the second __syncthreads()
, the Guide explains that
When the child grid returns, thread 0
is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second __syncthreads()
call.
In that specific example, the second __syncthreads()
is redundant since there is an implicit synchronization due to the kernel termination, but the second __syncthreads()
becomes needed when other operations must be performed following the child kernel launch.
Finally, concerning the sentence you are quoting in your post:
Those modifications become available to the other threads of the parent grid only after the second __syncthreads()
call
please, note that in the specific example there is only one thread block launched by the host_launch
function. This perhaps may have somewhat misled you.
There is an interesting discussion (probably even more than one) on the NVIDIA Forum on thread synchronization across blocks entitled
Synchronize all blocks in CUDA
__synchthreads()
to ensure that global memory of parent and child kernel are consistent, since all device-side kernel launches are asynchronous (this is mentioned in the CUDA Dynamic Parallelism Programming Guide). The second__synchthreads()
call is needed to ensure global memory consistency between the kernels inside the same thread block since launched child kernels may take different processing times to finish. – Vitality__threadfence_system()
in Memory Fence Functions for a way to make something written by one thread visible on the entire device. – Roger Dahl