I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).
A bit more context: The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:
matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]
With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:
cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);
I am making 128 such calls which I would like to do in one.
The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?
Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.
NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.
stored contiguously in memory? how exactly do they stored? - kangshiyin