How does BLAS incorporate matrix chain multiplication optimisation

Question

BLAS (basic linear algebra subprograms) provide many other programming languages, like Matlab, which I use, with fast routines to do things like matrix multiplication.

However, when multiplying multiple matrices together, there is an optimal order to "bracket" the matrices. Taken from the wikipedia article:

For example, suppose A is a 10 × 30 matrix, B is a 30 × 5 matrix, and C is a 5 × 60 matrix. Then,

(AB)C = (10×30×5) + (10×5×60) = 1500 + 3000 = 4500 operations

A(BC) = (30×5×60) + (10×30×60) = 9000 + 18000 = 27000 operations.

The article goes on to discuss ways you solve for the optimal order of this multiplication. My question is, are any of these optimisation procedures utilised in BLAS? If not, can I get better speed if I explicitly define the order of matrix multiplications in programs like Matlab with appropriate use of brackets?

paul-g paul-g · Accepted Answer · 2016-05-12T20:19:47

A canonical definition of BLAS can be found here and doesn't include a call with multiple matrices. So I don't expect any implementation which follows that definition provides the chaining optimisation you mention. It is hard to give a definitive answer, because BLAS has been done to death in the past 30 years, so there are many implementations out there and who knows, maybe some bored PhD student decided to do it at some point :)

That being said, there are implementations that are similar but not compatible with BLAS like Eigen which use features like Expression templates (...) to intelligently remove temporaries and enable lazy evaluation, when that is appropriate. This is something promising, but whether it applies to matrix chaining I don't really know. I suspect not, judging by the fact that its not included in their benchmark.

The bottom line is that the most reliable way to find out, is to just try it out yourself. You can check quite easily for your language/implementation of choice: just try out the example you have in your question but preferably with larger sizes, e.g. all dimensions times 100.

How does BLAS incorporate matrix chain multiplication optimisation

1 Answers