Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

Question

So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication

I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).

When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:

python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds

What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:

python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds

Did i make some obvious mistakes?

import numpy as np
import numba as nb
import scipy.sparse
import time


class benchmark(object):
    def __init__(self, name):
        self.name = name

    def __enter__(self):
        self.start = time.time()

    def __exit__(self, ty, val, tb):
        end = time.time()
        print("%s : %0.4f seconds" % (self.name, end-self.start))
        return False


def dot_py(A, B):
    m, n = A.shape
    p = B.shape[1]

    C = np.zeros((m, p))

    for i in range(0, m):
        for j in range(0, p):
            for k in range(0, n):
                C[i, j] += A[i, k] * B[k, j]
    return C


def dot_np(A, B):
    C = np.dot(A,B)
    return C


def dot_scipy(A, B):
    C = A * B
    return C

dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)

dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()

print("starting test")

with benchmark("python"):
    dot_py(a_full, b_full)

with benchmark("numpy"):
    dot_np(a_full, b_full)

with benchmark("numba"):
    dot_nb(a_full, b_full)

with benchmark("scipy"):
    dot_scipy(a, b)

print("finishing test")

edit:

for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero).

Probably running vendor BLAS, you can check this out by running np.__config__.show() and seeing if there is anything like MKL, OpenBLAS, ATLAS, etc. Edit: You are also running sparse matrices which will change things. SciPy actually has sparse BLAS I think while everything else will convert those matrices to dense. — Daniel
It seem i do not have any special BLAS installed. Wouldn't that impact scipy, numpy and maybe numba only though? Or would the three nested loops be optimized in the python function as well? — JanM

JoshAdel JoshAdel · Accepted Answer · 2016-06-23T14:49:06

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

1 Answers