Decrease cuda kernel runtime: dynamic memory allocation of matrices in kernel

Question

I want to perform OLS fit for a very large number of smaller matrices by running the matrix operations in parallel on a GPU. I have written code which seems to be functioning, however it is slower than anticipated. Currently, it takes shorter time to run it on a single thread on CPU despite the parallel computations on the GPU. Nvidia Visual Profiler seems to indicate that the memory allocation is taking up a lot of time. I suspect it is the dynamic memory allocation of different sized matrices inside the kernel that is the culprit. I need advice and help with speeding up the kernel runtime.

I have tried using new and delete for each matrix created in the loop.

Here is the kernel:

__global__
void comb_ols(double *y, double *X, double *R2 ,const unsigned int M, const unsigned int N, int* sub_col, int *sub_size, int* cumulative_size, const unsigned int numberOfCalculations){

    int size;   
    int start_index;

    int index = blockIdx.x*blockDim.x+threadIdx.x;
    int stride = blockDim.x*gridDim.x;  
    for(int i = index; i < numberOfCalculations; i+=stride){    

        size = sub_size[i];
        start_index = cumulative_size[i];             

        double *sub_matrix = new double[M*(1+size)];


            for(int j = 0; j < size; j++){
            for(int k  = 0; k<M; k++){
                sub_matrix[k] = 1;
                sub_matrix[k + M * (1 +  j)] = X[k + M * (sub_col[start_index+j]+1)];                                           
                                            }       
            }
        }

        R2[i] = getR2(y,sub_matrix,M,size+1);


        delete [] sub_matrix;
    }
}

In the device function getR2, we have the following:

__device__
double getR2(double *y, double *X ,const unsigned int M, const unsigned int N) {

    // Initilize values
    double R2, numerator;
    double* A = new double[N*N];
    double* IA = new double[N*N];
    double* yX = new double[N];  
    // Generate all components
    XtX(X, A, M, N);
    LUPDecompose(A, N);
    LUPInvert(A, N, IA);
    yTX(y, X, yX, M, N);
    // Calc R2
    numerator = olsR2numerator(yX, IA, N);
    R2 = numerator / yTy(y, M);
    //R2 = yTy(y,M);

    delete[] A;
    delete[] IA;
    delete[] yX;

    return R2;
}

The actual kernel call is like this:

com_ols<<<numBlocks, blockSize >>>(Y,X,R2,M,N,sub_columns, sub_size, cumulative_size, numberOfCalculations);

Currently, the kernel run time is rougly 1.4 seconds whereas on single-threaded cpu, it is 0.7 seconds. I expect the kernel run time to be much faster since it is only looping many iterations of matrix operations which should be appropiate for gpu. There is something inefficient with how memory of varying sized matrices is allocated. What do you guys say about storing various sized matrices dynamically inside the kernel? How should this be done in the most efficient way?

Any other feedback on given code is appreciated.

Are the "small" matrices always the same size, or are there a finite and well known number of "small" sizes? — talonmies
Yes the number of small matrices is known and finite. Also, the matrices are of different size in every iteration. — ShadySam

Unknown Unknown · Accepted Answer · 2019-04-30T08:29:12

It looks to me like three very simple rules of thumb are applicable here:

Dynamic memory allocation is always expensive, whatever platform you program on.
Performant code never uses dynamic memory allocation unless it is absolutely necessary.
If dynamic memory allocation is absolutely necessary, pre-allocate memory and re-use it as much as possible

If you look at your code, it violates all three of these concepts.

You clearly know (or could simply calculate) what the maximum value of sub_size is before the kernel launch. Use that a priori knowledge to your advantage -- pre-allocate heap memory for the calculations which is large enough to process the largest problem in the dataset and re-use it for the life of the thread. Your kernel could very easily look like something like this:

__global__
void comb_ols(double *y, double *X, double *R2 ,const unsigned int M, 
             const unsigned int N, int* sub_col, int *sub_size, int* cumulative_size, 
             const unsigned int numberOfCalculations, const int max_size){

    int size;   
    int start_index;

    int index = blockIdx.x*blockDim.x+threadIdx.x;
    int stride = blockDim.x*gridDim.x;

    double *sub_matrix = new double[M*(1+max_size)];
    R2scratch temp(1+max_size);

    for(int i = index; i < numberOfCalculations; i+=stride){    

        size = sub_size[i];
        start_index = cumulative_size[i];             
        for(int j = 0; j < size; j++){
            for(int k  = 0; k<M; k++){
                sub_matrix[k] = 1;
                sub_matrix[k + M * (1 +  j)] = X[k + M * (sub_col[start_index+j]+1)];                                           
                                            }       
            }
        }
        R2[i] = getR2(y,sub_matrix,M,size+1,temp);
    }
    delete [] sub_matrix;
}

and the device function something like this:

struct R2scratch
{
    double* A;
    double* IA;
    double* yX;  

    __device__
    R2scratch(int N) {
        A = new double[N*N];
        IA = new double[N*N];
        yX = new double[N];  
    };

    __device__
    ~R2scratch() {
        delete[] A;
        delete[] IA;
        delete[] yX;
    };
};

__device__
double getR2(double *y, double *X ,const unsigned int M, const unsigned int N, 
             R2scratch &scratch) {

    // Initilize values
    double R2, numerator;
    double* A = scratch.A;
    double* IA = scratch.IA;
    double* yX = scratch.yX;

    // Generate all components
    XtX(X, A, M, N);
    LUPDecompose(A, N);
    LUPInvert(A, N, IA);
    yTX(y, X, yX, M, N);
    // Calc R2
    numerator = olsR2numerator(yX, IA, N);
    R2 = numerator / yTy(y, M);
    //R2 = yTy(y,M);

    return R2;
}

[Code obviously written in browser, never compiled and tests, use at own risk].

By doing this you amortize the cost of a one time memory allocation over many calculations, which should be much more efficient that your current approach.

Decrease cuda kernel runtime: dynamic memory allocation of matrices in kernel

1 Answers