I wrote a C++ code to solve a linear system A.x = b
where A
is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T
with LAPACK(E) (because I need the eigenvalues later) and then solving x = A^-1.b = V^T.D^-1.V.b
where of course V
is orthogonal.
Now I would like to optimize this last operation as much as possible, e.g. by using (C)BLAS routines and OpenMP.
Here is my naive implementation:
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (int i=0; i<N; i++)
{
for (int j=0; j<N; j++)
{
for (int k=0; k<N; k++)
{
X[i] += B[j] * V[i+k*N] * V[j+k*N] / D[k];
}
}
}
}
All arrays are C-style arrays, where V
is of size N^2
, D
is of size N
, B
is of size N
and X
is of size N
(and initialized with zeros).
For now, this naive implementation is very slow and is the bottleneck of the code. Any hints and help would be very appreciated !
Thanks
EDIT Thanks to Jérôme Richard's answer and comment I further optimized his solution by calling BLAS and parallelizing the middle loop with OpenMP. On a 1000x1000 matrix, this solution is ~4 times faster as his proposition, which itself was 1000 times faster than my naive implementation.
I found the #pragma omp parallel for simd
clause the be faster than other alternatives on two different machines with 4 and 20 cores respectively, for N=1000
and N=2000
.
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
double* sum = new double[N]{0.};
cblas_dgemv(CblasColMajor,CblasTrans,N,N,1.,V,N,B,1,0.,sum,1);
#pragma omp parallel for simd
for (int i=0; i<N; ++i)
{
sum[i] /= D[i];
}
cblas_dgemv(CblasColMajor,CblasNoTrans,N,N,1.,V,N,sum,1,0.,X,1);
delete [] sum;
}