Multiple matrix-vector calls with CUBLAS

Question

I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).

A bit more context: The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:

matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]

With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:

cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);

I am making 128 such calls which I would like to do in one.

The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?

Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.

NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.

probably you should show some code. Since you are doing multiple calls, you might want to transpose the matrix first (maybe with Sgeam), then call the Sgemv operations without having to specify CUBLAS_OP_T Sounds like you're doing the same thing as what is described here — Robert Crovella
Why do you need do transpose? result[8x1] = matrix[8x2048] * vector[2048x1] It seems nothing need to be transposed. what do you mean by stored contiguously in memory? how exactly do they stored? — kangshiyin
Have a look at the cuBLAS library user guide. At Section 2.6, you will find how to use the batch functionality, which is a particular application of CUDA streams. — Vitality
What do you mean by sequential? Does the result of one gemv operation affect the result of the next operation? Or are they independent of one another? — talonmies
@Eric Each matrix is stored immediately after the previous one in memory; same for each vector. Sorry, the transpose was necessary - I hadn't reversed the dimensions to make that obvious in the question. — Will Williams

kangshiyin kangshiyin · Accepted Answer · 2013-09-13T13:00:02

Updated

In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.

As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].

On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.

Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.

1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
   B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] )  by 1 dgmm()

2. C[(2048*128)x8] = transpose( B[8x(2048*128)] )                  by 1 geam()

3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
   O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)]                  by 1 gemv()

4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
   output[8x128] = transpose( OO[128x8] )                          by 1 geam()

This col-major output[8x128] is what you want.

Since you need adding rather then replacing, you may need one more call to add the orginal values to output

Multiple matrix-vector calls with CUBLAS

2 Answers

Updated