No difference in Eigen performance AVX vs SSE for single precision matrix operations?

Question

In my project I use Eigen3.3 library to do calculations with 6x6 matrices. I decided to investigate whether AVX instructions really give me any speedup over SSE. My CPU does support both sets:

model name      : Intel(R) Xeon(R) CPU E5-1607 v2 @ 3.00GHz
flags           : ...  sse sse2 ... ssse3 ... sse4_1 sse4_2 ... avx ...

So, I compile a small test shown below with gcc4.8 using two different sets of flags:

$ g++ test-eigen.cxx -o test-eigen -march=native -O2 -mavx
$ g++ test-eigen.cxx -o test-eigen -march=native -O2 -mno-avx

I confirmed that the second case with -mno-avx did not produce any instructions with ymm registers. Nevertheless, the two cases give me very similar results of about 520ms as measured with perf.

Here is the program test-eigen.cxx (it does an inverse of the sum of two matrices just to be close to the actual task I am working on):

#define NDEBUG

#include <iostream>
#include "Eigen/Dense"

using namespace Eigen;

int main()
{
   typedef Matrix<float, 6, 6> MyMatrix_t;

   MyMatrix_t A = MyMatrix_t::Random();
   MyMatrix_t B = MyMatrix_t::Random();
   MyMatrix_t C = MyMatrix_t::Zero();
   MyMatrix_t D = MyMatrix_t::Zero();
   MyMatrix_t E = MyMatrix_t::Constant(0.001);

   // Make A and B symmetric positive definite matrices
   A.diagonal() = A.diagonal().cwiseAbs();
   A.noalias() = MyMatrix_t(A.triangularView<Lower>()) * MyMatrix_t(A.triangularView<Lower>()).transpose();

   B.diagonal() = B.diagonal().cwiseAbs();
   B.noalias() = MyMatrix_t(B.triangularView<Lower>()) * MyMatrix_t(B.triangularView<Lower>()).transpose();

   for (int i = 0; i < 1000000; i++)
   {
      // Calculate C = (A + B)^-1
      C = (A + B).llt().solve(MyMatrix_t::Identity());

      D += C;

      // Somehow modify A and B so they remain symmetric
      A += B;
      B += E;
   }

   std::cout << D << "\n";

   return 0;
}

Should I really expect better performance with AVX in Eigen? Or am I missing something in the compiler flags or in the eigen configuration? It is possible that my test is not suitable to demonstrate the difference but I don't see what might be wrong with it.

I was not aware that Eigen supported AVX yet. In fact it even supports AVX512 now. That's good to know eigen.tuxfamily.org/index.php?title=3.3#Vectorization. — Z boson

ggael ggael · Accepted Answer · 2017-07-25T00:27:24

You are using too small matrices to make use of AVX: with single precision, AVX works on packets of 8 scalars at once. When using 6x6 matrices, AVX can only be leveraged for pure component-wise operations like A = B + C, because they can be seen as operations on 1D vectors of size 36 that is larger than 8. In your case, those kinds of operations are negligible compared to the cost of the Cholesky factorization and solve.

To see the difference, move to MatrixXf matrices of size 100x100 or larger.

No difference in Eigen performance AVX vs SSE for single precision matrix operations?

1 Answers