3
votes

I've set up two identical tests in MATLAB & Python regarding matrix multiplication with broadcasting. For Python I used NumPy, for MATLAB I used the mtimesx library which uses BLAS.

MATLAB

close all; clear;

N = 1000 + 100; % a few initial runs to be trimmed off at the end

a = 100;
b = 30;
c = 40;
d = 50;
A = rand(b, c, a);
B = rand(c, d, a);
C = zeros(b, d, a);

times = zeros(1, N);
for ii = 1:N
    tic
    C = mtimesx(A,B);
    times(ii) = toc;
end

times = times(101:end) * 1e3;

plot(times);
grid on;
title(median(times));

Python

import timeit
import numpy as np
import matplotlib.pyplot as plt


N = 1000 + 100  # a few initial runs to be trimmed off at the end

a = 100
b = 30
c = 40
d = 50
A = np.arange(a * b * c).reshape([a, b, c])
B = np.arange(a * c * d).reshape([a, c, d])
C = np.empty(a * b * d).reshape([a, b, d])

times = np.empty(N)

for i in range(N):
    start = timeit.default_timer()
    C = A @ B
    times[i] = timeit.default_timer() - start

times = times[101:] * 1e3

plt.plot(times, linewidth=0.5)
plt.grid()
plt.title(np.median(times))
plt.show()
  • My Python is the default one installed from pip which uses OpenBLAS.
  • I'm running on Intel NUC i3.

The MATLAB code is running in 1ms while the Python in 5.8ms, and I can't figure out why, as it seems both of them are using BLAS.


EDIT

From Anaconda:

In [7]: np.__config__.show()
mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]

From numpy using pip

In [2]: np.__config__.show()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]

EDIT 2 I tried to replace C = A @ B with np.matmul(A, B, out=C) and got 2x worse time, e.g. around 11ms. This is really strange.

3
@etmuse Thanks, already saw that. My argument is that both matlab (or mtimesx) and numpy are using BLAS, so I don't see why there should be any difference.galah92
@galah92 Which BLAS, though. If you see the most voted answer in that post, it mentions that Matlab uses Intel MKL, which is very fast (on Intel hardware, at least). You can check what you NumPy distribution is using with np.show_config(); in my case it is OpenBLAS. The different between these two is significant.jdehesa
Btw, this is why there is intel-numpy, part of the Intel Distribution for Python.jdehesa
To reiterate the above: Please show the output of np.show_config()Eric

3 Answers

7
votes

Your MATLAB code uses floating point arrays, but the NumPy code uses integer arrays. This makes a significant difference in the timing. For an "apples to apples" comparison between MATLAB and NumPy, the Python/NumPy code must also use floating point arrays.

That is not the only significant issue, however. There is a deficiency in NumPy discussed in issue 7569 (and again in issue 8957) in the NumPy github site. Matrix multiplication of "stacked" arrays does not use fast BLAS routines to perform the multiplications. This means that the multiplication of arrays with more than two dimensions can be much slower than expected.

Multiplication of 2-d arrays does use the fast routines, so you can work around this issue by multiplying the individual 2-d arrays in a loop. Surprisingly, despite the overhead of a Python loop, it is faster than @, matmul or einsum applied to the full stacked arrays in many cases.

Here's a variation of a function shown in the NumPy issue that does the matrix multiplications in a Python loop:

def xmul(A, B):
    """
    Multiply stacked matrices A (with shape (s, m, n)) by stacked
    matrices B (with shape (s, n, p)) to produce an array with
    shape (s, m, p).

    Mathematically equivalent to A @ B, but faster in many cases.

    The arguments are not validated.  The code assumes that A and B
    are numpy arrays with the same data type and with shapes described
    above.
    """
    out = np.empty((a.shape[0], a.shape[1], b.shape[2]), dtype=a.dtype)
    for j in range(a.shape[0]):
        np.matmul(a[j], b[j], out=out[j])
    return out

My NumPy installation also uses MKL (it is part of the Anaconda distribution). Here is a timing comparison of A @ B and xmul(A, B), using arrays of floating point values:

In [204]: A = np.random.rand(100, 30, 40)

In [205]: B = np.random.rand(100, 40, 50)

In [206]: %timeit A @ B
4.76 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [207]: %timeit xmul(A, B)
582 µs ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even though xmul uses a Python loop, it takes about 1/8 the time of A @ B.

1
votes

I think this is a problem of memory ordering. Matlab's zeros(a, b, c) is like numpy's zeros((a, b, c), order='F'), which is not the default.

Of course, as you've already identified, @ operates on different axes to mtimesx. To make the comparison fair, you should ensure your arrays are in the matlab order, then transpose to deal with the semantics difference

# note: `order` in reshape actually changes the resulting array data,
# not just its memory layout
A = np.arange(a * b * c).reshape([b, c, a], order='F').transpose((2, 0, 1))
B = np.arange(a * c * d).reshape([c, d, a], order='F').transpose((2, 0, 1))
0
votes

Could you try again with the recently released NumPy 1.16? We refactored matmul to use BLAS for the inner two dimensions, which should speed up the code.