Overlapped input and output for cublas functions

Question

I'm working with some large data using the cublas library for matrix multiplication. To save memory space, I want something like A=A*B where A and B are both n-by-n square matrices, i.e. using the same memory space for the output and one of the input matrices.

While some old posts say this is not allowed in the cublas library, I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run. So I'm wondering if the overlapped input and output is supported by the current cublas library. If yes, how much memory does it actually save? I mean intuitively the function at least needs some extra memory to store intermediate calculations, since A_ij = A_ikB_kj is dependent on a whole row of A. Is this particularly memory saving for batched gemms?

I think in the general case input/output overlap in CUBLAS is not provided for, with a few exceptions like geam. I suspect in the case of the various matrix-multiply ops, including gemmStridedBatched, you might be able to cook up a demonstration case where you could observe a failure, perhaps involving matrices that were individually large enough. — Robert Crovella
@Robert Crovella Thanks for the comments. I made more tests using 'cublasDgemm()'. on my GPU, the in-place operation is correct until exactly 512-by-512 matrix, and it becomes wrong for 513-by-513 matrix. — Weixin Wang
The situations where this seems to work and seems not to work may depend on GPU type, CUDA version, and the specific CUBLAS function, and perhaps other variables. Your test case is enough to demonstrate that it is not provided for generally, and so I would be very hesitant to depend on it, even with specific restrictions. — Robert Crovella

Unknown Unknown · Accepted Answer · 2020-12-09T04:10:07

While some old posts say this is not allowed in the cublas library,

And they are completely correct (noting that the "old posts" were referring to the standard GEMM calls, not the batched implementations you are asking about).

I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run

This isn't documented as being safe and I suspect you are probably only getting stable results by luck, given that small matrices are probably preloaded into shared memory or registers and so an in-place operation works. If you went to larger matrices, I guess you would see failures, because eventually there would be a case where a single GEMM could not be performed without multiple trips to the source matrix after a write cycle, which would corrupt the source matrix.

I would not recommend in-place operations even if you find it works for one case. Different problem sizes, library versions, and hardware could produce failures which you simply haven't tested. The choice and associated risk is up to you.

Overlapped input and output for cublas functions

1 Answers