0
votes

My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:

while(true) {
  cudaMemcpyAsync(d_mem, h_mem, some_stream) 
  kernel_launch(some_stream)
  cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}

Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?

How to know when the h_mem can be reused? should I make some synchronization using cudaevents?

BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?

Asynchronous

For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed. For transfers between pinned host memory and device memory, the function is fully asynchronous. For transfers from device memory to pageable host memory, the function will return only once the copy has completed. For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread. For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

MemcpyAsynchronousBehavior

Thanks!

1

1 Answers

1
votes

In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.

Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.

You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.