Wait until any device has finished in CUDA?

Question

I have a CUDA kernel that I want to run across multiple GPUs. On each GPU, it's performing a search task, so I'd like to launch it on each GPU and then in the host code wait until any of the GPUs returns (indicating that it found what it was looking for).

I know about cudaDeviceSynchronize(), but that blocks until the current GPU is finished. Is there anything that will let me block until any one of N different GPUs finishes?

Robert Crovella Robert Crovella · Accepted Answer · 2020-11-24T01:56:44

CUDA doesn't provide any built-in functions to accomplish that directly.

I believe you would need to do something via polling, and then if you want to poll the results, you can. If you want to build something that blocks the CPU thread, I guess a spin on the polling operation would do it. (cudaDeviceSynchronize() is by default a spin operation under the hood)

You could build a polling system using various ideas:

cudaEvent - launch an event after each kernel launch, then use cudaEventQuery() operations to poll
cudaHostAlloc - use host-pinned memory that each kernel can update with status - read the memory directly
cudaLaunchHostFunc - put a callback in place after each kernel launch. The callback host function would update ordinary host memory, which you could poll for status.

The callback method (at least) would allow you (perhaps via atomics) to collapse the polling to a single memory location, if that were important for some reason. You could probably implement something similar using the host-pinned memory method for systems that have CUDA system atomic support.

Wait until *any* device has finished in CUDA?

1 Answers

Wait until any device has finished in CUDA?