25
votes

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more).

I'm wondering why is this the case? I realize that there is a lot of optimization and vectorizations such as AVX, etc going on with float matrix multiplication. But yet, there are instructions such AVX2, for integers (if I'm not mistaken). And, can't one make use of SSE and AVX for integers?

Why isn't there a heuristic underneath matrix algebra libraries such as Numpy or Eigen to capture this and perform integer matrix multiplication faster just like float?

About accepted answer: While @sascha's answer is very informative and relevant, @chatz's answer is the actual reason why the int by int multiplication is slow irrespective of whether BLAS integer matrix operations exist.

2
It would help to make the question more specific, but since more people need it for float, more effort was made to optimize it for float (in both software and hardware).Marc Glisse
This question is in need of a specific example code to demonstrate the performance difference (see minimal reproducible example). Particularly given that the code is tagged [c++] and [numpy] it is completely unclear what you are referring to.Zulan

2 Answers

14
votes

If you compile these two simple functions which essentially just calculate a product (using the Eigen library)

#include <Eigen/Core>

int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B)
{
    Eigen::MatrixXi C= A*B;
    return C(0,0);
}

int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B)
{
    Eigen::MatrixXf C= A*B;
    return C(0,0);
}

using the flags -mavx2 -S -O3 you will see very similar assembler code, for the integer and the float version. The main difference however is that vpmulld has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps. (On recent Intel architectures)

Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).

14
votes

All those vector-vector and matrix-vector operations are using BLAS internally. BLAS, optimized over decades for different archs, cpus, instructions and cache-sizes has no integer-type!

Here is some branch of OpenBLAS working on it (and some tiny discussion at google-groups linking it).

And i think i heard Intel's MKL (Intel's BLAS implementation) might be working on integer-types too. This talk looks interesting (mentioned in that forum), although it's short and probably more approaching small integral types useful in embedded Deep-Learning).