I am using my GPU concurrently with my CPU. When I profile memory transfers I find that the async calls in cuBLAS do not behave asynchronously.
I have code that does something like the following
cudaEvent_t event;
cudaEventCreate(&event);
// time-point A
cublasSetVectorAsync(n, elemSize, x, incx, y, incy, 0);
cudaEventRecord(event);
// time-point B
cudaEventSynchronize(event);
// time-point C
I'm using sys/time.h to profile (code omited for clarity). I find that the cublasSetVectorAsync call dominates the time as though it were behaving synchronously. I.e. the duration A-B is much longer than the duration B-C and increases as I increase the size of the transfer.
What are possible reasons for this? Is there some environment variable I need to set somewhere or an updated driver that I need to use?
I'm using a GeForce GTX 285 with Cuda compilation tools, release 4.1, V0.2.1221
cudaEventSynchronize). That's not even mentioning the fact that you are doing everything in the default stream, within which everything is synchronous except for kernel launches (relative to calling host thread). There are no kernel launches in the code you posted. - harrism