The data in the 3D matrix was generated by layers (from top to bottom) and I want to multiply that data with a 2D matrix B but istead of taking each layer I need to take a vector from layer 1, a vector from layer 2 and so on.
Currently what I'm doing is to copy those vectors from the 3D matrix to a 2D matrix tmpA then multiply with B (using CUBLAS) and store result in tmpB to finally copy back row by row to where it corresponds in a 3D matrix C.
Overall, my whole app runs at least twice as faster than the CPU version, but it seems to me that those memory copies (even) made from device to device are not very good at all for the performance.
What would be a better way to do this computation? I was thinking about rearranging data before multiplying, so to avoid the memory copies.
The 3D matrix A and C and the 2D matrix B are already in GPU's memory.
EDIT
Let M, N, P be the dimensions of the 3D matrix A stored in row major order in a linear array on the device's memory. My code looks like this:
cudaMalloc((void**)&d_tmpIn, sizeof(float)*M*P);
cudaMalloc((void**)&d_tmpOut, sizeof(float)*M*P);
cudaMalloc((void**)&d_C, sizeof(float)*M*N*P);
for (int iN = 0; iN < N; iN++)
{
dst = d_tmpIn;
for (int iM = 0; iM < M; iM++)
{
cudaMemcpy(dst, &(d_A[iN*P+0+iM*N*P]), sizeof(float)*P, cudaMemcpyD2D);
dst += P;
}
cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, P, M, M, &alpha, d_tmpIn, P, d_B, M, &beta, d_tmpOut, P);
src = d_tmpOut;
for (int iM = 0; iM < M; iM++)
{
cudaMemcpy(&(d_C[iN*P+0+iM*N*P]), src, sizeof(float)*P, cudaMemcpyD2D);
src += P;
}
}
Hope this helps.