1
votes

I’m using CUDA streams to enable asynchronous data transfers and hide memory copy latency. I have 2 CPU threads and 2 CUDA streams: one is “data” stream which is essentially a sequence of cudaMemcpyAsync calls initiated by first CPU thread and the other is “compute” stream which executes compute kernels. Data stream is preparing batches for compute stream so it is critical for compute stream to ensure that the batch which the stream is going to work on is completely loaded into memory.

Should I use CUDA events for such synchronization or some other mechanism?

Update: let me clarify why I cannot use separate streams with data copies/computation in each stream. The problem is that the batches must be processed in order that is, I cannot execute them in parallel (which, of course, would have been possible to do with multiple streams). However, when processing each batch, I can pre-load data for the next batch thus hiding data transfers. To use Robert’s example:

cudaMemcpyAsync( <data for batch1>, dataStream);
cudaMemcpyAsync( <data for batch2>, dataStream);
kernelForBatch1<<<..., opsStream>>>(...);
kernelForBatch2<<<..., opsStream>>>(...);
1

1 Answers

4
votes

You can certainly use cuda events to synchronize streams, such as using the cudaStreamWaitEvent API function. However the idea of putting all data copies in one stream and all kernel calls in another may not be a sensible use of streams.

cuda functions (API calls, kernel calls) issued within a single stream are guaranteed to be executed in order, with any cuda function in that stream not beginning until all previous cuda activity in that stream has completed (even if you are using calls such as cudaMemcpyAsync...)

So streams already give you a mechanism to ensure that a kernel call will not begin until the necessary data has been copied for it. Just put that kernel call in the same stream, after the data copy.

Something like this should take care of your synchronization:

cudaMemcpyAsync( <data for kernel1>, stream1);
cudaMemcpyAsync( <data for kernel2>, stream2);
kernel1<<<..., stream1>>>(...);
kernel2<<<..., stream2>>>(...);
cudaMemcpyAsync( <data from kernel1>, stream1);
cudaMemcpyAsync( <data from kernel2>, stream2);

All of the above calls are asynchronous, so assuming you've met the other requirements for asynchronous execution (such as using pinned memory), all of the above calls should "queue up" and return immediately. However kernel1 is guaranteed not to begin before the preceding cudaMemcpyAsync issued to stream1 has completed, and likewise for kernel2 and the data transfer in stream2.

I don't see any reason to break the above activity into separate CPU threads either. That unnecessarily complicates things. The most trouble-free way to manage a single device is from a single CPU thread.