My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:
while(true) {
cudaMemcpyAsync(d_mem, h_mem, some_stream)
kernel_launch(some_stream)
cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}
Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?
How to know when the h_mem can be reused? should I make some synchronization using cudaevents?
BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?
Asynchronous
For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed. For transfers between pinned host memory and device memory, the function is fully asynchronous. For transfers from device memory to pageable host memory, the function will return only once the copy has completed. For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread. For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
Thanks!