I'm working with some large data using the cublas library for matrix multiplication. To save memory space, I want something like A=A*B where A and B are both n-by-n square matrices, i.e. using the same memory space for the output and one of the input matrices.
While some old posts say this is not allowed in the cublas library, I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run. So I'm wondering if the overlapped input and output is supported by the current cublas library. If yes, how much memory does it actually save? I mean intuitively the function at least needs some extra memory to store intermediate calculations, since Aij = AikBkj is dependent on a whole row of A. Is this particularly memory saving for batched gemms?
geam. I suspect in the case of the various matrix-multiply ops, includinggemmStridedBatched, you might be able to cook up a demonstration case where you could observe a failure, perhaps involving matrices that were individually large enough. - Robert Crovella