2
votes

I have noticed that I can use memory blocks for matrices either allocated using cudamalloc() or cublasalloc() function to call cublas functions. The matrix transfer rates and computational are slower for arrays allocated using cudamalloc() rather than cublasalloc(), although there are other advantages to using arrays using cudamalloc(). Why is that the case? It would be great to hear some comments.

1
Do you see this when working with floats, doubles, or both? It could be an alignment issue (cudaAlloc() takes an elemSize argument, but cudaMalloc() doesn't).Gabriel
I was working with floats in both cases. Haven't seen this in double as I don't need to work with doubles in my application. I'll check with cudaAlloc at the same time.stanigator
Btw, Gabriel, do you mean ''cublasAlloc()'' rather than ''cudaAlloc()''?stanigator
Yeah, that was a typo. If you're seeing this with floats, then I don't know what to say about the performance difference.Gabriel
If you're just looking to optimize, check out the 2D aligned pitch mallocs for cublas (you'll need to use the lda and ldb terms in BLAS appropriately). It may give a significant speedup. And of course there's pinned memory too.Gabriel

1 Answers

5
votes

cublasAlloc is essentially a wrapper around cudaMalloc() so there should be no difference, is there anything else that changes in your code?