Cuda to make Matrix Multiplication

Question

have a problem making a Matrix Multiplication using cuda. I have to do A*A*A*A and save it in hB. With Cublas it's ok, but I can't make it with CUDA. Dimension can be a high value like 2000. This is my code:

__global__ void CudaMM(float *A, float *B, int N)
{

    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;

    float sum = 0.f;
    for (int n = 0; n < N; ++n)
        sum += A[row*N+n]*A[n*N+col];

    B[row*N+col] = sum;
}

void CudaMult(int dimension,float *hMatrice,float *hB,float *d_A,float *d_B){
    int N,K;
    K = 100;            
    N = K*BLOCK_SIZE;

    dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE);
    dim3 grid(K,K);

    cudaMemcpy(d_A,hMatrice,dimension*dimension*sizeof(float),cudaMemcpyHostToDevice);

CudaMM<<<grid,threadBlock>>>(d_A,d_B,N);

cudaMemcpy(hB,d_B,dimension*dimension*sizeof(float),cudaMemcpyDeviceToHost);


}

void CublasFindConnect(int dimension,float* mat,float* B){


    float *d_A,*d_B;
    cudaMalloc(&d_A,dimension*dimension*sizeof(float));
    cudaMalloc(&d_B,dimension*dimension*sizeof(float));

    int w=0;
    while(w<5){

        CudaMult(dimension,mat,B,d_A,d_B);

          // Copy Matrix computed B to previous M

            for (m=0; m<dimension; m++) {

                for (n=0; n<dimension; n++) {
                    mat[m*dimension+n]=B[m*dimension+n];
                    B[m*dimension+n]=0;
                }
            }

     w++;
    }

cudaFree(d_A);
cudaFree(d_B);

}

I've installed last CUDA 6 that it doesn't require cudaMemCpy, because memory is shared.

Robert Crovella Robert Crovella · Accepted Answer · 2014-02-21T16:40:30

I would suggest you start by doing proper cuda error checking on the code you have shown, and see what results you get.
It will be better if you show a complete code as well. For example what is BLOCK_SIZE? The idea is not to tell me what BLOCK_SIZE is, but to show a complete code.
As an aside, the feature you are referring to in CUDA 6 has specific requirements (such as the use of cudaMallocManaged()) that you're not meeting, but nevertheless your code is not dependent on Unified Memory, so it's irrelevant.

One problem I can see in your code is that your dimension variable is arbitrary (you say it can be up to a large number like 2000) but your computation size is fixed at N=K*BLOCK_SIZE;. Presumably if your BLOCK_SIZE is some value like 16 or 32, then it will meet your approximate max dimension size of ~2000.

The problem arises because your grid size is potentially larger than your valid array size. You are launching an NxN grid, but N can be larger than dimension. This means some of the launched threads can attempt to access the matrices (A and B) outside of their valid dimensions.

You can fix this with a "thread check" in your kernel, something like this:

__global__ void CudaMM(float *A, float *B, int N)
{

    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;

    if ((row < N) && (col < N)) {

      float sum = 0.f;
      for (int n = 0; n < N; ++n)
        sum += A[row*N+n]*A[n*N+col];

      B[row*N+col] = sum;
    }
}

and you will need to modify your kernel invocation to:

CudaMM<<<grid,threadBlock>>>(d_A,d_B,dimension);

You might also want to consider choosing grid sizes based on your actual dimension, rather than fixed at 100*BLOCK_SIZE, but that is not essential to get the code to work.

Cuda to make Matrix Multiplication

1 Answers