I wrote a Conjugate-gradient solver (for linear system of equations) with LU preconditioning, I used Dr. Maxim Naumov's papers on nvidia's research community as a guideline, the residuals update step, which requires solving a lower triangular matrix system and then solving an upper triangular matrix system is divided into two phases:
- analysis phase (which exploits the sparsity pattern and decides the parallelization level).
- the solution phase itself.
according to all posts related to this topic, and to Naumov's paper itself, the analysis phase is significantly slower than the solve phase, but it's executed once, so it's not supposed to be an issue when taking the whole execution time in consideration, however, in my program, the analysis phase requires ~35-45% of the whole solution time (time needed to perform all iterations!), which is quite annoying, another thing is that I'm quite sure that the matrix's sparsity pattern doesn't allow a big deal of parallelization, as nearly all elements require previous elements to be known (because in CFD each node will need at least the adjacent 6 nodes (hexahedral volumes) around it, and that's repeated on every row), so the analysis phase won't be very useful anyways !
matrixLU in this code contains both the upper and lower triangular matrices, the upper triangular matrix uses the original matrix diagonals, and the lower triangular matrix has unity diagonals (LU factorization).
// z = inv(matrixLU)*r
cusparseMatDescr_t descrL = 0 ;
cusparseMatDescr_t descrU = 0 ;
cusparseStatus = cusparseCreateMatDescr(&descrL) ;
cusparseStatus = cusparseCreateMatDescr(&descrU) ;
cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrL,CUSPARSE_DIAG_TYPE_UNIT) ;
cusparseSetMatFillMode(descrL,CUSPARSE_FILL_MODE_LOWER) ;
cusparseSetMatType(descrU,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrU,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrU,CUSPARSE_DIAG_TYPE_NON_UNIT) ;
cusparseSetMatFillMode(descrU,CUSPARSE_FILL_MODE_UPPER) ;
cusparseSolveAnalysisInfo_t inforL = 0 ;
cusparseSolveAnalysisInfo_t inforU = 0 ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforL) ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforU) ;
double startFA = omp_get_wtime() ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrL, matrixLU, iRow, jCol, inforL) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis1 Error !") ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrU, matrixLU, iRow, jCol, inforU) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis2 Error !") ;
double finishFA = omp_get_wtime() ;
So, anybody got a clue why the analysis phase is so slow ? and how it may be accelerated ? is the (analysis phase execution time)/(solve phase execution time) ratio GPU dependent ?
Edit: I tried this solver for many cases, and the results were close, but in the specific case I'm concerned about, it has the following conditions:
- Size (N) : ~860,000 * 860,000
- Number of non zeros (NZ): ~6,000,000
- Number of iterations required for convergence: 10
- Analysis phase execution time: 210 ms
- Solve phase executions time (summing up all iterations): 350 ms
- All floating point operations are performed in double precision format
- GPU: GeForce GTX 550 Ti
- OS: Windows 7 ultimate, 64 bit