I’m using CUDA streams to enable asynchronous data transfers and hide memory copy latency. I have 2 CPU threads and 2 CUDA streams: one is “data” stream which is essentially a sequence of cudaMemcpyAsync calls initiated by first CPU thread and the other is “compute” stream which executes compute kernels. Data stream is preparing batches for compute stream so it is critical for compute stream to ensure that the batch which the stream is going to work on is completely loaded into memory.
Should I use CUDA events for such synchronization or some other mechanism?
Update: let me clarify why I cannot use separate streams with data copies/computation in each stream. The problem is that the batches must be processed in order that is, I cannot execute them in parallel (which, of course, would have been possible to do with multiple streams). However, when processing each batch, I can pre-load data for the next batch thus hiding data transfers. To use Robert’s example:
cudaMemcpyAsync( <data for batch1>, dataStream);
cudaMemcpyAsync( <data for batch2>, dataStream);
kernelForBatch1<<<..., opsStream>>>(...);
kernelForBatch2<<<..., opsStream>>>(...);