I have noticed that I can use memory blocks for matrices either allocated using cudamalloc() or cublasalloc() function to call cublas functions. The matrix transfer rates and computational are slower for arrays allocated using cudamalloc() rather than cublasalloc(), although there are other advantages to using arrays using cudamalloc(). Why is that the case? It would be great to hear some comments.
cudaAlloc()
takes anelemSize
argument, butcudaMalloc()
doesn't). – Gabrielpitch
mallocs for cublas (you'll need to use thelda
andldb
terms in BLAS appropriately). It may give a significant speedup. And of course there's pinned memory too. – Gabriel