Eigen Matrix Multiplication Speed

Question

I was trying to do linear algebra numerical computation in C++. I used Python Numpy for quick model and I would like to find a C++ linear algebra pack for some further speed up. Eigen seems to be quite a good point to start.

I wrote a small performance test using large dense matrix multiplication to test the processing speed. In Numpy I was doing this:

import numpy as np
import time

a = np.random.uniform(size = (5000, 5000))
b = np.random.uniform(size = (5000, 5000))
start = time.time()
c = np.dot(a, b)
print (time.time() - start) * 1000, 'ms'

In C++ Eigen I was doing this:

#include <time.h>
#include "Eigen/Dense"

using namespace std;
using namespace Eigen;

int main() {
    MatrixXf a = MatrixXf::Random(5000, 5000);
    MatrixXf b = MatrixXf::Random(5000, 5000);
    time_t start = clock();
    MatrixXf c = a * b;
    cout << (double)(clock() - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl;
    return 0;
}

I have done some search in the documents and on stackoverflow on the compilation optimization flags. I tried to compile the program using this command:

g++ -g test.cpp -o test -Ofast -msse2

The C++ executable compiled with -Ofast optimization flags runs about 30x or more faster than a simple no optimization compilation. It will return the result in roughly 10000ms on my 2015 macbook pro.

Meanwhile Numpy will return the result in about 1800ms.

I am expecting a boost of performance in using Eigen compared with Numpy. However, this failed my expectation.

Is there any compile flags I missed that will further boost the Eigen performance in this? Or is there any multithread switch that can be turn on to give me extra performance gain? I am just curious about this.

Thank you very much!

Edit on April 17, 2016:

After doing some search according to @ggael 's answer, I have come up with the answer to this question.

Best solution to this is compile with link to Intel MKL as backend for Eigen. for osx system the library can be found at here. With MKL installed I tried to use the Intel MKL link line advisor to enable MKL backend support for Eigen.

I compile in this manner for all MKL enablement:

g++ -DEIGEN_USE_MKL_ALL -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl -m64 -I${MKLROOT}/include -I. -Ofast -DNDEBUG test.cpp -o test

If there is any environment variable error for MKLROOT just run the environment setup script provided in the MKL package which is installed default at /opt/intel/mkl/bin on my device.

With MKL as Eigen backend the matrix multiplication for two 5000x5000 operation will be finished in about 900ms on my 2.5Ghz Macbook Pro. This is much faster than Python Numpy on my device.

Are you sure you are running the above test cases? For 500x500 matrices I get benchmarks C++ 20ms, Python/Numpy: 310ms, for 5000x5000 matrices also C++ is an order of magnitude faster. (with -Ofast) — Charles Pehlivanian
@CharlesPehlivanian I am using my python Numpy to calculate the 500x500 matrices and that gives me 3ms running time where Eigen is about 10ms. Still cannot get faster than Numpy. — yc2986

ggael ggael · Accepted Answer · 2016-04-16T13:41:16

To answer on the OSX side, first of all recall that on OSX g++ is actually an alias to clang++, and the current Apple's version of clang does not support openmp. Nonetheless, using Eigen3.3-beta-1, and default clang++, I get on a macbookpro 2.6Ghz:

$ clang++ -mfma -I ../eigen so_gemm_perf.cpp  -O3 -DNDEBUG  &&  ./a.out
2954.91ms

Then to get support for multithreading, you need a recent clang of gcc compiler, for instance using homebrew or macport. Here using gcc 5 from macport, I get:

$ g++-mp-5 -mfma -I ../eigen so_gemm_perf.cpp  -O3 -DNDEBUG -fopenmp -Wa,-q && ./a.out
804.939ms

and with clang 3.9:

$ clang++-mp-3.9 -mfma -I ../eigen so_gemm_perf.cpp  -O3 -DNDEBUG -fopenmp  && ./a.out
806.16ms

Remark that gcc on osx does not knowhow to properly assemble AVX/FMA instruction,so you need to tell it to use the native assembler with the -Wa,-q flag.

Finally, with the devel branch, you can also tell Eigen to use whatever BLAS as a backend, for instance the one from Apple's Accelerate as follows:

$ g++ -framework Accelerate -DEIGEN_USE_BLAS -O3 -DNDEBUG so_gemm_perf.cpp  -I ../eigen  && ./a.out
802.837ms

Eigen Matrix Multiplication Speed

2 Answers