Context: I am using Eigen for Artificial Neural Network where the typical dimensions are around 1000 nodes per layer. So most of the operations are to multiplying matrix M
of size ~(1000,1000) with a vector of size 1000 or a batch of B vectors, which are represented as matrices of size Bx1000.
After training a neural network, I am using pruning - which is a common compression technique which ends up with sparse matrix (density of non empty parameters between 10 and 50%).
Goal: I would like to use sparse matrix for compression purpose and secondarily for performance optimization but it is not the main goal
Issue: I am comparing performance of sparse and dense matrix multiplication (only multiplication time is computed) for different batch sizes and I am observing the following (Using Eigen 3.2.8, MacBook Pro 64bits, without open_mp, and using standard g++):
- when B=1 (Matrix x Vector) - sparse matrix operations with density 10% or 30% is more efficient than dense matrix operations - which seems expected result: far less operations are performed
- for B=32:
- the time needed for dense matrix operation is only ~10 times the time need for B=1 - which is cool - does it shows some vectorization effect?
- the time needed for sparse matrix operation is 67 times the time needed for B=1 - which means that it is less efficient than processing the 32 vectors independently
MxN multiplication time (ms) for M sparse/dense, and N of size 1000xB
Code: I am using the following types for sparse and dense matrices:
typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;
the operation I am benchmarking is the following:
o.noalias()=m*in.transpose();
where o
is a dense matrix (1000xB), m
is either a dense matrix (1000x1000) or the corresponding sparse matrix obtained with m.sparseView()
, and in
is a dense matrix (Bx1000)
A full code is below (averaging time for 20 different random matrices, and running each multiplication 50 times) - time for B=32 and B=1 are below.
Any feedback/intuition is welcome!
batch 1 ratio 0.3 dense 0.32 sparse 0.29
batch 32 ratio 0.3 dense 2.75 sparse 15.01
#include <Eigen/Sparse>
#include <Eigen/Dense>
#include <stdlib.h>
#include <boost/timer/timer.hpp>
using namespace Eigen;
using namespace boost::timer;
typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;
void bench_Sparse(const spMatFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}
void bench_Dense(const deMatRowFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}
int main(int argc, const char **argv) {
float ratio=0.3;
int iter=20;
int batch=32;
float t_dense=0;
float t_sparse=0;
deMatRowFloat d_o1(batch,1000);
deMatRowFloat d_o2(batch,1000);
for(int k=0; k<iter; k++) {
deMatRowFloat d_m=deMatRowFloat::Zero(1000,1000);
deMatRowFloat d_b=deMatRowFloat::Random(batch,1000);
for(int h=0;h<ratio*1000000;h++) {
int i=rand()%1000;
int j=rand()%1000;
d_m(i,j)=(rand()%1000)/500.-1;
}
spMatFloat s_m=d_m.sparseView();
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Dense(d_m,d_b,d_o1);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_dense+=elapsed/1000000.;
}
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Sparse(s_m,d_b,d_o2);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_sparse+=elapsed/1000000.;
}
}
std::cout<<"batch\t"<<batch<<"\tratio\t"<<ratio<<"\tdense\t"<<t_dense/50/iter<<"\tsparse\t"<<t_sparse/50/iter<<std::endl;
}
New Results after ggael suggestion: I tried the different possible combinations and found indeed huge differences of performance when changing M
and B
RowMajor/ColMajor.
To summarize I am interested in doing M*B
where M
is (1000,1000) and B
is (1000,batch): I am interested in comparing performance of M sparse/dense and when batch is growing.
I tested 3 configurations:
- M dense, B dense
- M sparse, B dense
- M sparse, B dense, but the multiplication of M*B is done manually column by column
results are as following - where the number is the ratio time per column for B=32/time for B=1 with matrix M with density 0.3:
The initial reported problem was the worse case (M ColMajor, B RowMajor). For (M RowMajor, B ColMajor), there is a 5 times speedup between B=32 and B=1 and performance of sparse matrix is almost equivalent to dense matrix.