NUMBA CUDA slower than parallel CPU even for giant matrices

Question

There are only a few examples online on using cuda for numba and I find them all to be slower than the parallel CPU method. Vectorise with CUDA target and stencils are even worse so I tried to create a custom kernel. The one blogpost you find everywhere is https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c. This example is a simple blur filter:

import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x_gpu = np.ones((10000, 10000), dtype='float32')
out_gpu = np.zeros((10000, 10000), dtype='float32')

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))

and the CPU version:

x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))

I get ~500 ms for the GPU version and 50 ms for the CPU one. What is going on?

I'm imagining that if you wrote this comment then you may be confused about the difference between cupy and numpy arrays. A cupy array lives on the GPU. Therefore when it is used in a kernel (cupy or numba) no copying is implied or needed. But since you converted that example to numpy arrays, you stepped into the copying scenario. So performance wise the two realization are not equivalent. The answer below shows how to make them roughly "equivalent" (cupy vs. numpy) when starting with numpy arrays. — Robert Crovella
You are right. I think he still misses some "cuda.synchronize()", no? Now the answer from below is actually faster in direct comparison. I will stay with numpy + numba for now :) — JP K.
Yes, I suspect that example as written is only measuring GPU kernel launch overhead, but I haven't studied it carefully. I also don't know what GPU is being used there, it didn't seem to be mentioned, but maybe it is in a previous blog post somewhere. Anyway I haven't studied it carefully. timeit may actually be smearing some different behaviors together. For careful timing of a kernel-only execution in cupy or numba, I would suggest the method I indicate below: use device-resident arrays, and be careful to use cuda.synchronize() to guard the timing region. — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2020-07-24T02:40:26

There are 2 things I would point out:

You are including in your timing of the GPU version the time it takes to transfer the input array from host to device, and the results from device to host. If this is the intent of your comparison, then so be it; the conclusion is the GPU is not suited to this task (in an interesting way).
The GPU code while giving correct results is not organized for good performance. The issue lies here:
```
i, j = cuda.grid(2)
```
coupled with the order those indices are being used to access data:
```
out[i, j] = (x[i - 1, j - 1] ...
```
this results in inefficient access in the GPU. We can fix that by reversing one of the two orders depicted above.

Here are your codes tweaked slightly taking into account both issues above:

$ cat t29a.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,x.shape[0]-1):
        for j in range(1,x.shape[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29a.py
CPU Time: 0.161944s

$ cat t29.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
import math

@numba.cuda.jit
def smooth_gpu(x, out):
    j, i = cuda.grid(2)
    m, n = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x = np.ones((10000, 10000), dtype='float32')
out = np.zeros((10000, 10000), dtype='float32')
x_gpu = cuda.to_device(x)
out_gpu = cuda.device_array_like(out)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
cuda.synchronize()
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
cuda.synchronize()
print("GPU Time: {0:1.6f}s ".format(time.time() - start_time))
$ python t29.py
GPU Time: 0.021776s
$

So we see if we adjust for both issues indicated, the GPU (a GTX 960 in my case) is about 8x faster than the CPU. Such measurements depend to some degree on the CPU and GPU used for comparison -- you shouldn't assume my measurements are comparable to yours -- its better for you to run these modified codes for comparison. However, the data transfer time certainly exceeds the GPU computation time by a sizeable margin, and in my case also exceeds the CPU computation time. This means (in my case at least, not a particularly fast system in any respect) that even if we reduced the GPU computation time to zero, the cost to transfer the data would still exceed the CPU computation time cost.

Therefore, when you run into such a situation, it's impossible to win. About the only advice that can be given then is "don't do that", i.e. find a more interesting and more complex problem for the GPU to solve. If we make the problem very simple computationally, such as this one, or such as vector add, and that is the only thing you want to do on the GPU, it is almost never an interesting comparison to doing it on the CPU. Hopefully you can see that making the matrix bigger doesn't help much here, because it also impacts the data transfer time/cost.

If we factor out the data transfer cost (and don't make performance crippling mistakes in our GPU code), according to my testing the GPU is faster than the CPU. If we include the data transfer cost, for this very simple problem, its quite possible that there is no way the GPU can be faster than the CPU (even if the GPU computation time were reduced to zero).

There is no doubt that more could be done to slightly improve the GPU case (e.g. change the block shape, use shared memory, etc.), but I personally don't wish to spend my time polishing uninteresting things.

You can get additional description of Numba GPU memory management here.

A general description of the memory efficiency issue related to index ordering is here

NUMBA CUDA slower than parallel CPU even for giant matrices

2 Answers

Code

Output: