3
votes

The data in the 3D matrix was generated by layers (from top to bottom) and I want to multiply that data with a 2D matrix B but istead of taking each layer I need to take a vector from layer 1, a vector from layer 2 and so on.

Currently what I'm doing is to copy those vectors from the 3D matrix to a 2D matrix tmpA then multiply with B (using CUBLAS) and store result in tmpB to finally copy back row by row to where it corresponds in a 3D matrix C.

Overall, my whole app runs at least twice as faster than the CPU version, but it seems to me that those memory copies (even) made from device to device are not very good at all for the performance.

What would be a better way to do this computation? I was thinking about rearranging data before multiplying, so to avoid the memory copies.

The 3D matrix A and C and the 2D matrix B are already in GPU's memory.

EDIT

Let M, N, P be the dimensions of the 3D matrix A stored in row major order in a linear array on the device's memory. My code looks like this:

cudaMalloc((void**)&d_tmpIn, sizeof(float)*M*P);
cudaMalloc((void**)&d_tmpOut, sizeof(float)*M*P);
cudaMalloc((void**)&d_C, sizeof(float)*M*N*P);

for (int iN = 0; iN < N; iN++)
{
  dst = d_tmpIn;
  for (int iM = 0; iM < M; iM++)
  {
    cudaMemcpy(dst, &(d_A[iN*P+0+iM*N*P]), sizeof(float)*P, cudaMemcpyD2D);
    dst += P;
  }

  cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, P, M, M, &alpha, d_tmpIn, P, d_B, M, &beta, d_tmpOut, P);

  src = d_tmpOut;
  for (int iM = 0; iM < M; iM++)
  {
    cudaMemcpy(&(d_C[iN*P+0+iM*N*P]), src, sizeof(float)*P, cudaMemcpyD2D);
    src += P;
  }
}

Hope this helps.

1
Can you describe how the data is stored in GPU memory and what CUBLAS call you are using to do this calculation? It isn't all that clear what you are actually trying to do from the text (hint: equations and short code snippets are worth a thousand words)talonmies
Normally a cudaMemcpyD2D should be quite quick. Have you profiled the app to determine where the time is being spent ?Robert Crovella
@RobertCrovella indeed they are fast but I was wondering if there is a better way to do it to avoid those memory copies. I'll take a look to the answer given and see whether that helps.BRabbit27

1 Answers

4
votes

You don't need to do memory copies! The BLAS and LAPACK APIs were created in such a way that you can specify the starting point, the stride length, the length of the leading dimensions and so on.

This way you can use the 3D arrays A and C as is, but call cublasDgemm by using the correct parameters.

In your case (if I understand the code correctly) it looks like each matrix should be P X M and you have N of them. But it looks like the 3D array is arranged as PxNxM. So without allocating memory for d_tmpIn and d_tmpOut, you could do something like this: The number of rows of A are P. the number of columns are M. However, the leading dimension (lda) should be mentioned as N * P. The same goes for C.

int lda = N * P;
int ldc = N * P;
for (int iN = 0; iN < N; iN++)
{
  double *d_tmpIn = d_A + iN * P;
  double *d_tmpOut = d_C + iN * P;
  cublasSetStream(streams[iN]); // Optional
  cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
              P, M, M, &alpha, d_tmpIn, lda, d_B, M, &beta, d_tmpOut, ldc);

}

You could also create iN streams and run each cublas run in a separate stream. Note that this is only going to be useful if M and P are small enough (i.e. the GPU is not yet saturated computationally)

EDIT If you do plan to go ahead with streams, try to create them once at the beginning of the program and re-use them. Do not create and destroy streams in the same loop as the Dgemm. This increases overhead.