Cuda Matrix Example Block Size

Question

I just started learning CUDA and I have been looking at examples on NVIDIA's website. Specifically, I have implemented the non-shared version of the matrix multiply (the first sample is the non-shared version even though it is in the shared memory section):

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

I am having a problem with the output when I change the block sizes. NVIDIA's code has a default block size of 16 and this gives me the correct output when I multiply two matrices. However, if I change the block size to anything above 16 (while still being a multiple of 16), I get an output of zero for all elements in the matrix. I tested this on my laptop too and noticed the same results for anything over 32 rather than 16. Could someone explain what is happening? I have two 9800GTX+ video cards in SLI and so I should have a maximum block size of (512,512,1). Why can I only do 16?

Also, I am noticing the same behavior in the shared version of the matrix multiplication (also on the NVIDIA page).

I didn't post my code because I get the same problem if I directly copy the code from the NVIDIA site.

I would really appreciate any help with this or with resources to learn more about these kinds of CUDA details.

Thank you!

I have attached the code as requested:

    #include "stdio.h"
    #include <cuda.h>
    #include <assert.h>
    #include <time.h>
    #include <math.h>

    // This is an example CUDA program that compares the timings of a matrix multiplication.
    // The comparisons are between the CPU, GPU, and the GPU with shared memory.

    #define BLOCK_SIZE 32

    typedef struct {

    int width;
    int height;
    int stride;
    float* elements;

    } Matrix;

    typedef void (*FuncPtr)(Matrix& A, Matrix& B, Matrix& C);

    void multiplyMatrix(Matrix& A, Matrix& B, Matrix& C);

    // Helper declarations
    void initializeMatrix(Matrix& A, int rows, int cols, float val);
    void copyMatrix(Matrix& dest, Matrix& src);
    void freeMatrix(Matrix& A);
    void printError(cudaError_t err);
    void printMat(Matrix& A);
    void setVal(Matrix& A, float val);
    double applyMultFunc(FuncPtr func, Matrix& A, Matrix& B, Matrix& C, int numOfIters);

    // CUDA declarations
    __global__ void cudaMultMat(Matrix A, Matrix B, Matrix C);


   int main() {

       printf("Beginning Matrix Multiplication Comparison\n");

       // Initialize matrix
       Matrix A, B, C;
       int rowsA = 32;
       int colsA = 32;
       int colsB = 32;
       initializeMatrix(A, rowsA, colsA, 5.0f);
       initializeMatrix(B, colsA, colsB, 2.0f);
       initializeMatrix(C, rowsA, colsB, 0.0f);

       // C = A * B using CPU, GPU, and GPU with shared memory
       FuncPtr gpuMatMult = &multiplyMatrix;
       int numOfIterations = 100;
       double multTime = applyMultFunc(gpuMatMult, A, B, C, numOfIterations);

       printMat(C); 

       // Update user
       printf("Normal Mat Mult Time: %f\n", multTime);


       // Cleanup
       freeMatrix(A);
       freeMatrix(B);
       freeMatrix(C);

       printf("\nPress Enter to continue...\n");
       getchar();

       return 0;

  }

  void multiplyMatrix(Matrix& A, Matrix& B, Matrix& C) {

    // Initialize device matrices
    Matrix deviceA, deviceB, deviceC;
    copyMatrix(deviceA, A);
    copyMatrix(deviceB, B);
    copyMatrix(deviceC, C);

    // Initialize number of blocks and threads
    dim3 numOfThreadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);
    int xSize = (C.width + numOfThreadsPerBlock.x - 1) / numOfThreadsPerBlock.x;
    int ySize = (C.height + numOfThreadsPerBlock.y - 1) / numOfThreadsPerBlock.y;
    dim3 numOfBlocks(xSize, ySize);

    // Call CUDA kernel
    cudaMultMat<<<numOfBlocks, numOfThreadsPerBlock>>>(deviceA, deviceB, deviceC);
    printError(cudaThreadSynchronize());
    printError(cudaMemcpy(C.elements, deviceC.elements, C.height * C.width * sizeof(float), cudaMemcpyDeviceToHost));

    // Free cuda memory
    printError(cudaFree(deviceA.elements));
    printError(cudaFree(deviceB.elements));
    printError(cudaFree(deviceC.elements));

  }



 // CUDA definitions

 // GPU matrix multiplication (non-shared memory)
 __global__ void cudaMultMat(Matrix A, Matrix B, Matrix C) {

    // If the matrices are of the wrong size then return
    if(A.width != B.height) {
        return;
    }

    // Initialize the indexes into the grid
    int col = (blockDim.x * blockIdx.x) + threadIdx.x;
    int row = (blockDim.y * blockIdx.y) + threadIdx.y;

    // Initialize the result
    float cVal = 0.0f;

    // Find the result for the dot product of a row of A and a column of B
    for(int i = 0; i < A.width; i++) {

        cVal += A.elements[row * A.width + i] * B.elements[i * B.width + col];

     }

     // If we are in bounds then save the result
     if(row < C.height && col < C.width) {
        C.elements[row * C.width + col] = cVal;
     }

  } 

  // Helper functions
  void initializeMatrix(Matrix& A, int rows, int cols, float val) {

    A.width = cols;
    A.height = rows;
    A.stride = A.width;
    int numOfElements = A.width * A.height;
    A.elements = (float*) malloc(numOfElements * sizeof(float));
    for(int i = 0; i < numOfElements; i++) {
        A.elements[i] = val;
    }

   }

   void copyMatrix(Matrix& dest, Matrix& src) {

    dest.width = src.width;
    dest.height = src.height;
    dest.stride = src.stride;
    int size = src.width * src.height * sizeof(float);
    printError(cudaMalloc(&dest.elements, size)); 
    printError(cudaMemcpy(dest.elements, src.elements, size, cudaMemcpyHostToDevice));

   }

   void freeMatrix(Matrix& A) {
    free(A.elements);
   }

   void printError(cudaError_t err) {
    if(err != 0) {
        printf("CUDA ERROR: %s\n", cudaGetErrorString(err));
        getchar();
    }

    }

    void printMat(Matrix& A) {

    printf("*********************************\n");
    for(int i = 0; i < A.height; i++) {
         for(int j = 0; j < A.width; j++) {
             int index = i * A.width + j;
             printf("%2.1f, ", A.elements[index]); 
         }
         printf("\n");
     }

  }

  void setVal(Matrix& A, float val) {

     for(int i = 0; i < A.width * A.height; i++) {
          A.elements[i] = val;
     }

  }

  double applyMultFunc(FuncPtr func, Matrix& A, Matrix& B, Matrix& C, int numOfIters) {

    clock_t startTime = clock();
    for(int i = 0; i < numOfIters; i++) {
        func(A, B, C);
     } 
     clock_t endTime = clock();
     return (double) (endTime - startTime) / CLOCKS_PER_SEC;

   }

please post your code, its important to see the size of matrices you are using or grid size, and a lot of other things due to which this can happen — Bharat

Robert Crovella Robert Crovella · Accepted Answer · 2013-11-20T07:17:09

You're exceeding the threads per block specification of your GPU when you increase the block sizes.

The 9800GTX has a limit of 512 threads per block, regardless of how you create the block. 16*16 = 256 which is OK. 32 x 32 = 1024 which is not OK. In this case the kernel fails to run and so the output is not correct.

Your laptop probably has a newer GPU which supports 1024 threads per block, so 32 x 32 is OK but anything larger is not.

If you add proper cuda error checking to the code you can confirm this. Note that this code appears to have cuda error checking, but the checking implemented on the kernel call is incoomplete. Study the link I gave and you will see the difference. If you modify the code with complete error checking, you will see the error.

Cuda Matrix Example Block Size

2 Answers