4
votes

EDITED to correspond with current state after linked question.

I am currently trying to reimplement basic Matrix multiplication in CUDA, and while my code works fine for Square matrices, and Rectangular Matrices whose dimensions are multiples of 8, it does not appear to function for Rectangular Matrices, whose dimensions are not multiples of 8.

The following is my Kernel multiplication function:

 __global__ void matrixMultiply(float * A, float * B, float * C,
               int numARows, int numAColumns,
               int numBRows, int numBColumns,
               int numCRows, int numCColumns) {
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    int Col = blockIdx.x * blockDim.x + threadIdx.x;
    if (numAColumns != numBRows) return ;
    if ((Row < numARows) && (Col < numBColumns)){
        float Cvalue = 0;
        for (int k = 0 ; k < numAColumns ; ++k )
            Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
        C[Row*numCColumns + Col] = Cvalue;
    }

}

The following is the memory allocation(for readability I have cut out the error checking):

cudaMalloc((void**) &deviceA, ARows*sizeof(float)*AColumns);
cudaMalloc((void**) &deviceB, BRows*sizeof(float)*BColumns);
cudaMalloc((void**) &deviceC, CRows*sizeof(float)*CColumns);
cudaMemcpy(deviceA, hostA, ARows*sizeof(float)*AColumns, cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, BRows*sizeof(float)*BColumns, cudaMemcpyHostToDevice);
cudaMemcpy(deviceC, hostC, CRows*sizeof(float)*CColumns, cudaMemcpyHostToDevice);

While the following is the Call:

dim3 dimGrid((int)ceil(numCRows / 8.0) , (int)ceil(numCColumns / 8.0), 1);
dim3 dimBlock(8 , 8, 1);
multiplyMatrices<<<dimGrid,dimBlock>>>(deviceA, deviceB, deviceC, numARows, AColumns, BRows, BColumns, CRows, CColumns);

And finally moving the memory Back: cudaMemcpy(hostC, deviceC, CRows*sizeof(float)*CColumns, cudaMemcpyDeviceToHost);

Now I have traced my algorithm repeatedly, and I do not believe there to be anything wrong with it, so I personally think there might be something wrong with the Block/Grid sizing scheme I've used. If anybody who knows CUDA/C better then I do (Ruby/JavaScript guy here), could take a look at it, and walk me through what exactly it is that I am doing wrong, I would be very very grateful.

1
This question may be of interest.Robert Crovella
Why do we need all the row and column variables? Don't numARows, ARows, CRows, all have to be the same number? Likewise for BColumns, and CColums. Anyway since things work for dimensions divisible by 8, my guess is that when the dimensions are not divisible by 8 (forcing extra blocks on the boundary) that some of your threads are not correctly shut off. So I would focus on this line: if ((Row < numARows) && (Col < numBColumns)){ What happens if you change it to: if ((Row < numCRows) && (Col < numCColumns)){? Or, I'd like to see your numerical values for every row and column variable.Robert Crovella
All those input parameters are part of the template code given by the instructor.Barry Brown
You didnt copy matrix B to the device? Or it is a typo in the question?pQB
good spot, unfortunately it was just a typo. Revising the questionAbraham P

1 Answers

3
votes

The problem is with the grid size you are creating:

dim3 dimGrid((int)ceil(numCRows / 8.0) , (int)ceil(numCColumns / 8.0), 1);

As rows is the Y dimension of the matrix and columns is the X dimension, so you are actually creating the transposed grid.

To create the correct grid, do the following:

dim3 dimGrid((int)ceil(numCColumns / 8.0) , (int)ceil(numCRows / 8.0), 1);

A better approach is to do the following:

dim3 dimGrid;

dimGrid.x = (numCColumns + dimBlock.x - 1)/dimBlock.x;

dimGrid.y = (numCRows + dimBlock.y - 1)/dimBlock.y;