Numba Python - how to exploit parallelism effectively?

Question

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.

My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.

On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.

I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.

Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.

How can I get better results?

import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda

lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True


@njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
    output = (float)(0.0)
    multconst = (float)(.99)
    output = np.sum(np.multiply(testarray, multconst))
    return output


@njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
    lengthX = testarray.shape[1]
    lengthY = testarray.shape[0]
    output = (float)(0.0)
    multconst = (float)(.99)
    for y in prange(lengthY):
        for x in prange(lengthX):
            output += multconst*testarray[y,x]
    return output


@njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
    lengthX = testarray.shape[1]
    lengthY = testarray.shape[0]
    output = (float)(0.0)
    multconst = (float)(.99)
    for y in prange(lengthY):
        for x in prange(int(lengthX/4)):
            xn = x*4
            output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
    return output



# ======================================= TESTS =======================================

testarray = np.random.rand(lengthY, lengthX)

# ==== MAC_numpy ====
time = 1000
for n in range(iters):
    start = timer()
    output = MAC_numpy(testarray)
    end = timer()
    if((end-start) < time): #get shortest time
        time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))

# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
    start = timer()
    output = MAC_01(testarray)
    end = timer()
    if((end-start) < time): #get shortest time
        time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))

# ==== MAC_04 ====
time = 1000
for n in range(iters):
    start = timer()
    output = MAC_04(testarray)
    end = timer()
    if((end-start) < time): #get shortest time
        time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))

Your cpu have two 256Bit FMA units, which results in a peak performance of (2*256)/32(bitsize of float32)*2 (1x multiply, 1x add)=32 Flops per cycle per core. It isn't simple to come even to 20% of the peak performance of about 1200 GFLOPS in a real world example. Maybe it would be better to start with C and as an example a well known algorithm like matrix, matrix multiplication and try to come as close as possible to mkl gemm which is very well optimized (np.dot if you use Intel MKL). Example: gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0 — max9111

user3666197 user3666197 · Accepted Answer · 2020-01-10T01:26:50

Q : How can I get better results?

1^st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s _{not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback}

Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.

#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum(            testarray)*multconst #######################

2^nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?

3^rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)

4^th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )

5^th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )

Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. _{Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this:
np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )

>>> np.random.rand( 10, 2 ).dtype
dtype('float64')

Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance}

Numba Python - how to exploit parallelism effectively?

1 Answers