1
votes

Attached a minimal example:

from numba import jit
import numba as nb
import numpy as np

@jit(nb.float64[:, :](nb.int32[:, :])) 
def go_fast(a): 
    trace = 0.0
    for i in range(a.shape[0]):  
        trace += np.tanh(a[i, i]) 
    return a + trace          

@jit 
def go_fast2(a): 
    trace = 0.0
    for i in range(a.shape[0]):  
        trace += np.tanh(a[i, i]) 
    return a + trace 

Running in Jupyter:

x = np.arange(10000).reshape(100, 100)
%timeit go_fast(x)
%timeit go_fast2(x)

leads to

5.65 µs ± 27.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

3.8 µs ± 46.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Why the eager compilations leads to a slower execution?

1

1 Answers

2
votes

Knowing that the memory accesses are contiguous simplifies the life of an optimizer (here is an example for Cython, but similar holds for numba, even if clang is often more clever than gcc).

Your example seems to be such a case:

  1. without "eager compilation" the numba will detect, that the data is C-contiguous and utilize it, e.g. for vectorization.
  2. with eager compilation, you don't provide this information, thus the optimizer must take into account, that the memory accesses could be non-contiguous and will create a jit-code which is less performant than the first version.

Thus, you should provide a more precise signature:

@jit(nb.float64[:, ::1](nb.int32[:, ::1])) 
def go_fast3(a): 
    trace = 0.0
    for i in range(a.shape[0]):  
        trace += np.tanh(a[i, i]) 
    return a + trace

[:,::1] tells numba, that the data will be C-contiguous and once this information is utilized:

x = np.arange(10000).astype(np.int32).reshape(100, 100)
%timeit go_fast(x)     # 15.6 µs ± 241 ns per loop
%timeit go_fast2(x)    # 8.2 µs ± 90.7 ns per loop
%timeit go_fast3(x)    # 8.2 µs ± 49.6 ns per loop

there is no difference for the eagerly compiled version.