1
votes

I wrote a C++ code to solve a linear system A.x = b where A is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T with LAPACK(E) (because I need the eigenvalues later) and then solving x = A^-1.b = V^T.D^-1.V.b where of course V is orthogonal.

Now I would like to optimize this last operation as much as possible, e.g. by using (C)BLAS routines and OpenMP.

Here is my naive implementation:

// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
    #ifdef _OPENMP
    #pragma omp parallel for
    #endif
    for (int i=0; i<N; i++)
    {
        for (int j=0; j<N; j++)
        {
            for (int k=0; k<N; k++)
            {
                X[i] += B[j] * V[i+k*N] * V[j+k*N] / D[k];
            }
        }
    }
}

All arrays are C-style arrays, where V is of size N^2, D is of size N, B is of size N and X is of size N (and initialized with zeros).

For now, this naive implementation is very slow and is the bottleneck of the code. Any hints and help would be very appreciated !

Thanks

EDIT Thanks to Jérôme Richard's answer and comment I further optimized his solution by calling BLAS and parallelizing the middle loop with OpenMP. On a 1000x1000 matrix, this solution is ~4 times faster as his proposition, which itself was 1000 times faster than my naive implementation.

I found the #pragma omp parallel for simd clause the be faster than other alternatives on two different machines with 4 and 20 cores respectively, for N=1000 and N=2000.

void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{

    double* sum = new double[N]{0.};

    cblas_dgemv(CblasColMajor,CblasTrans,N,N,1.,V,N,B,1,0.,sum,1);

    #pragma omp parallel for simd
    for (int i=0; i<N; ++i)
    {
        sum[i] /= D[i];
    }

    cblas_dgemv(CblasColMajor,CblasNoTrans,N,N,1.,V,N,sum,1,0.,X,1);

    delete [] sum;
}
1

1 Answers

1
votes

This code is currently highly memory-bound. Thus the resulting program will probably poorly scale (as long as compiler optimization are enabled). Indeed, on most common systems (eg. 1 socket non-NUMA processor) the RAM throughput is a shared resource between cores and also a scarce one. Moreover, the memory access pattern is inefficient and the algorithmic complexity of the code can be improved.

To speed up the computation, the j and k loops can be swapped so that V is read contiguously. Moreover, the division by V[i+k*N] and D[k] becomes constants in the most inner loop. The computation can then be factorized to be much faster since B[j] and V[j+k*N] is not dependent of i too. The resulting algorithm runs in O(n^2) rather than O(n^3) thanks to sum precomputations !

Finally, omp simd can be used to help compilers vectorizing the code, making it even faster!

Note that _OPENMP seems useless here since #pragma should ignored by compilers when OpenMP is disabled or not supported.

// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
    std::vector<double> kSum(N);

    #pragma omp parallel for
    for (int k=0; k<N; k++)
    {
        const double sum = 0.0;

        #pragma omp simd reduction(+:sum)
        for (int j=0; j<N; j++)
        {
            sum += B[j] * V[j+k*N];
        }

        kSum[k] = sum / D[k];
    }

    // Loop tiling can be used to speed up this section even more.
    // The idea is to swap i-based and j-based loops and work on thread-private copies
    // of X and finally sum the thread-private versions into a global X.
    // The resulting code should work on contiguous data and can even be vectorized.
    #pragma omp parallel for
    for (int i=0; i<N; i++)
    {
        double sum = X[i];

        for (int k=0; k<N; k++)
        {
            sum += kSum[k] * V[i+k*N];
        }

        X[i] = sum;
    }
}

The new code should be several order of magnitude faster than the original one (but still memory-bound). Note that results might be a bit different (as floating-point operations are not really associative), but I expect results to be be more accurate.