I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
@njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
@njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
@njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))