5
votes

Would it be possible to implement a class that receives a C-style pointer as a template argument and somehow resolves into a static Eigen matrix but using the memory provided? Say a declaration would look something like:

EIGEN_ALIGN16 double array[9];
CMatrix<double,3,3,array> :: m;

I do know about maps, but the example code I provide below has proven them to be slower by 20% when compared to static Eigen matrices.

These would be the premises:

  • I need to provide my own C pointer. This way I can efficiently reuse C code without incurring copies.
  • The resulting matrix should look static to Eigen so that Eigen can optimize as it would with a static array at compile time. Look at my example above where at compile time I would provide both matrix size (static) and the C pointer.
  • CMatrix should fall back to Eigen::Matrix. When the additional template parameter for the C array is not provided I would get the normal Eigen matrix.
  • I do not intend to make a full Eigen extension. With that I mean is I do not care about all kind of checks to provide a neat extension for other users. I just want the most efficient solution for this problem

Would it be possible to implement a solution by adding a new constructor? Say something like:

EIGEN_ALIGN16 double data[9];
Eigen::Matrix<double,3,3> m(data); //where data is NOT copied but used to replace the static allocation used by default.

Find below my code for benchmarking Map vs. Matrix efficiency. It is self contained and you can compile with:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I/path_to_my_Eigen benchmark1.cpp -o benchmark1 -lrt

Here is the code:

#include <Eigen/Eigen>
#include <bench/BenchTimer.h>

#include <iostream>

using namespace Eigen;
using namespace std;

//#define CLASSIC_METHOD
#define USE_MAPS

EIGEN_DONT_INLINE void classic(double VO[4], double AT[4][4], double VI[4])
{
  for (int ii=0; ii<4; ii++)
    {
      VO[ii] = AT[ii][0] * VI[0] +
               AT[ii][1] * VI[1] +
               AT[ii][2] * VI[2] +
               AT[ii][3] * VI[3];
    }
};

template <typename OutputType, typename MatrixType, typename VectorType>
EIGEN_DONT_INLINE void modern(MatrixBase<OutputType>& VOE, const MatrixBase<MatrixType>& A44, const MatrixBase<VectorType>& VIE)
{
  VOE.noalias() = A44.transpose()*VIE;
};

int main()
{
  EIGEN_ALIGN16 double AT[4][4] = {0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0};
  EIGEN_ALIGN16 double VI[4] = {1, 2, 3, 4};
  EIGEN_ALIGN16 double VO[4];

//Eigen matrices

#ifndef USE_MAPS
  Matrix4d A44 = Matrix4d::MapAligned(AT[0]);
      Vector4d VIE = Vector4d::MapAligned(VI);
  Vector4d VOE(0,0,0,0);
#else
  Map<Matrix4d,Aligned> A44(AT[0]);
  Map<Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);

  // Map<Matrix4d> A44(AT[0]);                                                                                                                                                                                                                                      
  // Map<Vector4d> VIE(VI);                                                                                                                                                                                                                                           
  // Map<Vector4d> VOE(VO);

#endif

#ifdef EIGEN_VECTORIZE
  cout << "EIGEN_VECTORIZE defined" << endl;
#else
    cout << "EIGEN_VECTORIZE NOT defined" << endl;
#endif

  cout << "VIE:" << endl;
  cout << VIE << endl;

  VI[0] = 3.14;
  cout << "VIE:" << endl;
  cout << VIE << endl;

  BenchTimer timer;

  const int num_tries = 5;
  const int num_repetitions = 200000000;

#ifdef CLASSIC_METHOD
  BENCH(timer, num_tries, num_repetitions, classic(VO, AT, VI));
  std::cout << Vector4d::MapAligned(VO) << std::endl;
#else
  BENCH(timer, num_tries, num_repetitions, modern(VOE, A44, VIE));
  std::cout << VOE << std::endl;
#endif

  double elapsed = timer.best();
  std::cout << "elapsed time: " << elapsed*1000.0 << " ms" << std::endl;

  return 0;
}
1
Why would you think it is not possible?gha.st
hint: start with using std::size_t; template <typename T, size_t Rows, size_t Cols, T (&Var)[Rows][Cols]> struct MyEfficientVector { using matrix_type = decltype(Var); matrix_type m = Var; } ; and run from there :DMassa
@Massa. Thank you. Would I need then to Inherit from Matrix/MatrixBase? how can I make this struct in the end to be an Eigen::Matrix? (I mean, I still want to have all the efficient operations, etc. that Eigen offers)Alejandro
@gha.st. What would you suggest to implement this then?Alejandro
@Alejandro that is a very good question. I don't know if it's even possible, given that Eigen::MatrixBase has its data in a Eigen::DenseStorage. Maybe you should start by giving a look at Eigen/DenseStorage.h and understand what you should do to replace its default storage schema for one of your design. In any case, your plan to do something "quick and dirty" is botched, because if you want to extend Eigen's functionalities, you'll certainly have to make a "proper" Eigen extension...Massa

1 Answers

5
votes

Rather off-topic but since you stressed performance:

Eigen assembly isn't always optimal - there is some overhead due to poor register reuse and writing back to memory (this is not to blame Eigen by any means - doing this in generic templates is an impossible task).

If your kernels are fairly simple (QCD?), I would write assembly by hand (using intrinsics).

Here is your classical kernel rewritten in intrinsics, faster than Eigen version and same for Map/Matrix types (so you dont have to invent your own types).

EIGEN_DONT_INLINE void classic(double * __restrict__ VO, const double * __restrict__ AT, const double * __restrict__ VI) {
  __m128d vi01 = _mm_load_pd(VI+0);
  __m128d vi23 = _mm_load_pd(VI+2);
  for (int i = 0; i < 4; i += 2) {
    __m128d v00, v11;
    // v[i+0,i+0]                                                                                                                                                                                                   
    {
      int ii = i*4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v00 = _mm_mul_pd(at01, vi01);
      v00 = _mm_add_pd(v00, _mm_mul_pd(at23, vi23));
    }
    // v[i+1,i+1]                                                                                                                                                                                                   
    {
      int ii = i*4 + 4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v11 = _mm_mul_pd(at01, vi01);
      v11 = _mm_add_pd(v11, _mm_mul_pd(at23, vi23));
    }

    __m128d v = _mm_hadd_pd(v00, v11);
    // v is now [v00[0] + v00[1], v11[0] + v11[1]]                                                                                                                                                                               
    _mm_store_pd(VO+i, v);
    // VO[i] += AT[j+0 + i*4]*VI[j+0];                                                                                                                                                                              
    // VO[i] += AT[j+1 + i*4]*VI[j+1];                                                                                                                                                                              
  }
}

You may gain some additional improvement by interleaving loads and mul/adds - I tried to keep it simple.

These are the results:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 611.397 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 615.541 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 981.941 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 838.852 ms

On further note, you could possibly write a better simd kernel if you matrix was transposed - horizontal adds (_mm_hadd_pd) are expensive.

To add to discussion in comments: moving Eigen maps inside the function removes difference in time between map and matrix arguments.

EIGEN_DONT_INLINE void mapped(double (&VO)[4], const double (&AT)[4][4], const double (&VI)[4]) {
  Map<const Matrix4d,Aligned> A44(&AT[0][0]);
  Map<const Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);
  VOE.noalias() = A44.transpose()*VIE;
}

This is top of the assembly when passing Map to function (function not inlined)

    movq    (%rsi), %rcx
    movq    (%rdx), %rax
    movq    (%rdi), %rdx
    movapd  (%rcx), %xmm0
    movapd  16(%rcx), %xmm1
    mulpd   (%rax), %xmm0
    mulpd   16(%rax), %xmm1

compared to passing array reference (and map inside) or matrix

    movapd  (%rsi), %xmm0
    movapd  16(%rsi), %xmm1
    mulpd   (%rdx), %xmm0
    mulpd   16(%rdx), %xmm1