Cuda program for matrix batch multiplication

Question

I am a novice in the field of CUDA program and I am trying to repeat the function of cublasSgemmBatched, which means that I want to perform the matrix-matrix multiplication of a batch of matrices. I try to implement my idea as the following code.

#include <stdio.h>

__global__ void BatchMulCUDA(float* array1, float* array2, int narray1, int dim, float* result)
{
    int tx = blockIdx.x * blockDim.x + threadIdx.x;

    if (tx < narray1 * dim)
    {
        float temp = 0;
        int index = tx / dim;
#pragma

        for (int i = 0; i < dim; i++)
        {
            temp += array1[tx * dim + i] * array2[index * dim + i];
        }

        result[tx] = temp;
    }
} 

void BatchMulGPU(float* array1, float* array2, int narray1, int dim, float* result)
{
    dim3 threads(1024, 1);
    dim3 grid(narray1 / 1024 + 1, 1);
    int threadsPerBlock = threads.x * threads.y;
    int blocksPerGrid = grid.x * grid.y;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    BatchMulCUDA<<<grid, threads>>>(array1, array2, narray1, dim, result);
}

However, strangely, I found that I can get the right output before the index 19730. After the element of 19730, the output of GPU is always 0. I do not know what the problem is. The CPU version of my code and test function are as the following. Is there any hardware limitation that I do not realize?

#include "kernel.h"

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
#include <math.h>

double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}

void BatchMulCPU(float* array1, float* array2, int narray1, int dim, float* result)
{
    for (int i = 0; i < narray1 * dim; i++)
    {
        float temp = 0;
        int index = i / dim;
        for (int j = 0; j < dim; j++)
        {
            temp += array1[i * dim + j] * array2[index * dim + j];
        }
        result[i] = temp;
    }
}

int main(int argc, char** argv)
{
    int narray1 = 6980;
    int dim = 4;

    float* array1 = new float[narray1 * dim * dim];
    float* array2 = new float[narray1 * dim];
    float* resultGPU = new float[narray1 * dim];
    float* resultCPU = new float[narray1 * dim];

    float* d_array1;
    float* d_array2;
    float* d_result;

    for (int i = 0; i < narray1 * dim * dim; i++)
    {
        array1[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
    }

    for (int i = 0; i < narray1 * dim; i++)
    {
        array2[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
    }

    cudaError_t err;

    double iStart = cpuSecond();
    err = cudaMalloc((void**)&d_array1, narray1 * dim * dim * sizeof(float));
    err = cudaMalloc((void**)&d_array2, narray1 * dim * sizeof(float));
    err = cudaMalloc((void**)&d_result, narray1 * dim * sizeof(float));

    err = cudaMemcpy(d_array1, array1, narray1 * dim * dim * sizeof(float), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray1 * dim * sizeof(float), cudaMemcpyHostToDevice);

    BatchMulGPU(d_array1, d_array2, narray1, dim, d_result);

    err = cudaMemcpy(resultGPU, d_result, narray1 * dim * sizeof(float), cudaMemcpyDeviceToHost);

    double iElaps = cpuSecond() - iStart;

    printf("Total GPU computation time is %lf \n" , iElaps);

    iStart = cpuSecond();
    BatchMulCPU(array1, array2, narray1, dim, resultCPU);
    iElaps = cpuSecond() - iStart;

    printf("Total CPU computation time is %lf \n" , iElaps);

    float error = 0;
    float temp = 0;
    for (long i = 0; i < narray1 * dim; i++)
    {
        // temp = abs(resultCPU[i] - resultGPU[i]);
        // if (temp > 0.5)
        // {
        //  std::cout << i << std::endl;
        // }
        error += abs(resultCPU[i] - resultGPU[i]);

    }

    printf("Error is %f \n", error);

    // for (int i = 19730; i < 19750; i++)
    // {
    //  std::cout << "GPU " << resultGPU[i] << std::endl;
    //  std::cout << "CPU " << resultCPU[i] << std::endl;
    // }

    cudaFree(d_array1);
    cudaFree(d_array2);
    cudaFree(d_result);

    return 0;
}

I would guess this is a watchdog timer problem. If you perform a bit of simple error checking, I will guess you see a runtime error generated at large sizes — talonmies
@talonmies Actually, if I set the narray equals to 100, I will get the right result. — Sean
You may want to look at TDR and WDDM as described here: stackoverflow.com/a/37040352/6218300 — Florent DUGUET

Robert Crovella Robert Crovella · Accepted Answer · 2019-03-27T01:28:43

Apart from the possibility of a WDDM TDR timeout as discussed in the comments, the code has an error.

Its evident that the kernel design expects that a total grid size (total number of threads) will be launched that is equal to or greater than the number of arrays times the side dimension:

int tx = blockIdx.x * blockDim.x + threadIdx.x;

if (tx < narray1 * dim)

i.e. narray1*dim are the needed number of threads

However the number being launched is only narray1:

dim3 threads(1024, 1);
dim3 grid(narray1 / 1024 + 1, 1);

If we change the last line above to:

dim3 grid((narray1*dim) / 1024 + 1, 1);

this code design error will be addressed.

The reason the code works correctly for small number of matrices (anything up to 256) is because of the rounding-up effect in the grid sizing to a minimum of 1024 threads, which is 256*4 (narray1 * dim).

As an aside, this code is not functionally similar to cublasSgemmBatched from what I can see. I don't recognize this code as being any matrix multiplication (matrix dot product) that I am familiar with.

Cuda program for matrix batch multiplication

1 Answers