Currently I'm just trying to implement simple Linear Regression algorithm in matrix-form based on cuBLAS with CUDA. Matrix multiplication and transposition works well with cublasSgemm function.
Problems begins with matrix inversions, based on cublas<t>getrfBatched() and cublas<t>getriBatched() functions (see here).
As it can be seen, input parameters of these functions - arrays of pointers to matrices. Imagine, that I've already allocated memory for (A^T * A) matrix on GPU as a result of previous calculations:
float* dProdATA;
cudaStat = cudaMalloc((void **)&dProdATA, n*n*sizeof(*dProdATA));
Is it possible to run factorization (inversion)
cublasSgetrfBatched(handle, n, &dProdATA, lda, P, INFO, mybatch);
without additional HOST <-> GPU memory copying (see working example of inverting array of matrices) and allocating arrays with single element, but just get GPU-reference to GPU-pointer?