CUDA doesn't provide any built-in functions to accomplish that directly.
I believe you would need to do something via polling, and then if you want to poll the results, you can. If you want to build something that blocks the CPU thread, I guess a spin on the polling operation would do it. (cudaDeviceSynchronize() is by default a spin operation under the hood)
You could build a polling system using various ideas:
cudaEvent - launch an event after each kernel launch, then use cudaEventQuery() operations to poll
cudaHostAlloc - use host-pinned memory that each kernel can update with status - read the memory directly
cudaLaunchHostFunc - put a callback in place after each kernel launch. The callback host function would update ordinary host memory, which you could poll for status.
The callback method (at least) would allow you (perhaps via atomics) to collapse the polling to a single memory location, if that were important for some reason. You could probably implement something similar using the host-pinned memory method for systems that have CUDA system atomic support.