Eigen matrix multiplication slower than cblas?

Question

I use the following code to test Eigen perfomance.

#include <iostream>
#include <chrono>
#define EIGEN_NO_DEBUG
#include <eigen3/Eigen/Dense>
#include <cblas.h>
using namespace std;
using namespace std::chrono;

int main()
{
    int n = 3000;

    high_resolution_clock::time_point t1, t2;

    Eigen::MatrixXd A(n, n), B(n, n), C(n, n);

    t1 = high_resolution_clock::now();
    C = A * B;
    t2 = high_resolution_clock::now();
    auto dur = duration_cast<milliseconds>(t2 - t1);
    cout << "eigen: " << dur.count() << endl;

    t1 = high_resolution_clock::now();
    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
                n, n, n, 1.0, A.data(), n, B.data(), n, 1.0, C.data(), n);
    t2 = high_resolution_clock::now();
    dur = duration_cast<milliseconds>(t2 - t1);
    cout << "cblas: " << dur.count() << endl;

    return 0;
}

I compile it with the following command:

g++ test.cpp  -O3 -fopenmp -lblas -std=c++11 -o test

The results are:

eigen: 1422 ms

cblas: 432 ms

Am i doing something wrong? According to their benchmark it should be faster.

Another problem is that using numpy i get 24 ms

import time 
import numpy as np

a = np.random.random((3000, 3000))
b = np.random.random((3000, 3000))
start = time.time()
c = a * b
print("time: ", time.time() - start)

With numpy arrays, * is element-wise multiplication. Change c = a * b to c = a.dot(b). Or, if you are using a sufficiently new version of Python 3 and numpy, you can write c = a @ b. — Warren Weckesser
Avi Ginsburg, Eigen version 3.2, g++ version 4.9.2, the problem was in using Eigen 3.2 — John X.
Warren Weckesse, sorry about a*b i decided to test it just before asking the question, and forgot about a.dot(b) — John X.

ggael ggael · Accepted Answer · 2016-12-01T08:13:31

Saying that you are using cblas provide very little information because cblas is just an API. The underlying BLAS library could be netlib's BLAS, OpenBLAS, ATLAS, Intel MKL, Apple's Accelerate, or even EigenBlas... Given your measurements, it is pretty obvious that your underlying BLAS is an highly optimized one exploiting AVX+FMA+multi-threading. So for fair comparison, you must also enable those feature on Eigen's side by compiling with -march=native -fopenmp and make sure you are using Eigen 3.3. Then the performance should be about the same.

Regarding numpy, Warren Weckesser already solved the issue. You could have figured out yourself that 24ms to perform 2*3000^3=54e9 floating point operations on a standard computer is impossible.

Eigen matrix multiplication slower than cblas?

1 Answers