Asynchronous CUDA transfer calls not behaving asynchronously

Question

I am using my GPU concurrently with my CPU. When I profile memory transfers I find that the async calls in cuBLAS do not behave asynchronously.

I have code that does something like the following

cudaEvent_t event;
cudaEventCreate(&event);
// time-point A
cublasSetVectorAsync(n, elemSize, x, incx, y, incy, 0);
cudaEventRecord(event);
// time-point B
cudaEventSynchronize(event);
// time-point C

I'm using sys/time.h to profile (code omited for clarity). I find that the cublasSetVectorAsync call dominates the time as though it were behaving synchronously. I.e. the duration A-B is much longer than the duration B-C and increases as I increase the size of the transfer.

What are possible reasons for this? Is there some environment variable I need to set somewhere or an updated driver that I need to use?

I'm using a GeForce GTX 285 with Cuda compilation tools, release 4.1, V0.2.1221

Are you sure that passing in a null stream doesn't make the operation synchronous? — Joachim Isaksson
@JoachimIsaksson I checked this out and no, 0 is the default stream. It should still be asynchronous. Section 3.2.5.5.2 of the CUDA C Programming guide. — MRocklin
Is this your real code? You don't have any thing here for cublasSetVectorAsync to overlap. All you do is call an async function, and then synchronize it (cudaEventSynchronize). That's not even mentioning the fact that you are doing everything in the default stream, within which everything is synchronous except for kernel launches (relative to calling host thread). There are no kernel launches in the code you posted. — harrism
@harrism My real code is more complex than this. This is the minimum example I could think of that clearly demonstrates the problem. I'm only interested in asynchronicity between Host and Device so I don't think that streams are an issue. I agree that this code does not test concurrency within the GPU. — MRocklin
My bad. You are of course correct. Let me try to be more helpful... I will ask the CUBLAS team if this is correct, but my suggestion is that since cudaMemcpyAsync requires pinned host memory, then you need to use cudaHostAlloc to allocate the host memory that is input to cublasSetVectorAsync. Otherwise it will have to copy to a pinned region itself (host memcpy) before copying to the device, and this would explain the behavior you are seeing. — harrism

harrism harrism · Accepted Answer · 2012-09-25T23:56:10

cublasSetVectorAsync is a thin wrapper around cudaMemcpyAsync. Unfortunately, in some circumstances, the name of this function is a misnomer, as explained on this page from the CUDA reference manual.

Notably:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

And

For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed.

So the solution to your problem is likely to just allocate x, your host data array, using cudaHostAlloc, rather than standard malloc (or C++ new).

Alternatively, if your GPU and CUDA version support it, you can use malloc and then call cudaHostRegister on the malloc-ed pointer. Note in the documentation the condition that you must create your CUDA context with the cudaDeviceMapHost flag in order for cudaHostRegister to have any effect (see the documentation for cudaSetDeviceFlags.

Asynchronous CUDA transfer calls not behaving asynchronously

2 Answers