I want to perform OLS fit for a very large number of smaller matrices by running the matrix operations in parallel on a GPU. I have written code which seems to be functioning, however it is slower than anticipated. Currently, it takes shorter time to run it on a single thread on CPU despite the parallel computations on the GPU. Nvidia Visual Profiler seems to indicate that the memory allocation is taking up a lot of time. I suspect it is the dynamic memory allocation of different sized matrices inside the kernel that is the culprit. I need advice and help with speeding up the kernel runtime.
I have tried using new and delete for each matrix created in the loop.
Here is the kernel:
__global__
void comb_ols(double *y, double *X, double *R2 ,const unsigned int M, const unsigned int N, int* sub_col, int *sub_size, int* cumulative_size, const unsigned int numberOfCalculations){
int size;
int start_index;
int index = blockIdx.x*blockDim.x+threadIdx.x;
int stride = blockDim.x*gridDim.x;
for(int i = index; i < numberOfCalculations; i+=stride){
size = sub_size[i];
start_index = cumulative_size[i];
double *sub_matrix = new double[M*(1+size)];
for(int j = 0; j < size; j++){
for(int k = 0; k<M; k++){
sub_matrix[k] = 1;
sub_matrix[k + M * (1 + j)] = X[k + M * (sub_col[start_index+j]+1)];
}
}
}
R2[i] = getR2(y,sub_matrix,M,size+1);
delete [] sub_matrix;
}
}
In the device function getR2, we have the following:
__device__
double getR2(double *y, double *X ,const unsigned int M, const unsigned int N) {
// Initilize values
double R2, numerator;
double* A = new double[N*N];
double* IA = new double[N*N];
double* yX = new double[N];
// Generate all components
XtX(X, A, M, N);
LUPDecompose(A, N);
LUPInvert(A, N, IA);
yTX(y, X, yX, M, N);
// Calc R2
numerator = olsR2numerator(yX, IA, N);
R2 = numerator / yTy(y, M);
//R2 = yTy(y,M);
delete[] A;
delete[] IA;
delete[] yX;
return R2;
}
The actual kernel call is like this:
com_ols<<<numBlocks, blockSize >>>(Y,X,R2,M,N,sub_columns, sub_size, cumulative_size, numberOfCalculations);
Currently, the kernel run time is rougly 1.4 seconds whereas on single-threaded cpu, it is 0.7 seconds. I expect the kernel run time to be much faster since it is only looping many iterations of matrix operations which should be appropiate for gpu. There is something inefficient with how memory of varying sized matrices is allocated. What do you guys say about storing various sized matrices dynamically inside the kernel? How should this be done in the most efficient way?
Any other feedback on given code is appreciated.