I found this comparison interesting, and wanted to look into the impact of reusing compiled kernels, cuda streams, and randomized data to ensure no fancy compiler optimizations were skewing what we saw.
I modified the code sample posted by Robert Crovella and ran the script on a modest ML rig at school:
Code
import numpy as np
from time import perf_counter
from numba import njit, prange,cuda
# cpuinfo is a third party package from here:
# https://github.com/workhorsy/py-cpuinfo
# or you can just install it using pip with:
# python -m pip install -U py-cpuinfo
from cpuinfo import get_cpu_info
print("Some diagnostic info for the system running this script:")
# prints information about the cuda GPU
cuda.detect()
print()
# Prints a json string describing the cpu
s = get_cpu_info()
print("Cpu info")
for k,v in s.items():
print(f"\t{k}: {v}")
print()
cpu_s1 = "CPU execution time:"
cpu_s2 = "CPU full setup/execution time:"
gpu_s1 = "GPU kernel execution time:"
gpu_s2 = "GPU full kernel setup/execution time:"
l = len(gpu_s2) + 1
# using randomized floats to ensure there isn't some compiler optimization that
# recognizes that all values of the x array are constant 1's and does something
# goofy under the hood. Each timing scenario will then use a copy of this array.
common_x = np.random.random((10000, 10000)).astype(np.float32)
def time_njit(n_loops=2):
start_time_full_function = perf_counter()
@njit(parallel=True,nogil=True)
def smooth_cpu(x, out):
h,w = x.shape
for i in prange(1,h-1):
for j in range(1,w-1):
out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] +
x[i - 1, j + 1] + x[i , j - 1] +
x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] +
x[i + 1, j + 1]) / 9
pre_x = np.ones((10,10),dtype=common_x.dtype)
pre_out = np.ones((10,10),dtype=common_x.dtype)
_x = common_x.copy()
_out = np.zeros_like(_x)
# run on cpu
smooth_cpu(pre_x, pre_out) # compile before measuring time
start_time = perf_counter()
for _ in range(n_loops):
# realistically, we wouldn't typically run just a single blurring pass
smooth_cpu(_x, _out)
smooth_cpu(_out,_x)
end_time = perf_counter()
end_time_full_function = perf_counter()
print(f"{cpu_s1:<{l}} {end_time - start_time:1.6f}s running {n_loops} loops"
f"\n{cpu_s2:<{l}} {end_time_full_function - start_time_full_function:1.6f}s")
return _x
def time_cuda(n_loops=2):
"""There is room for optimization in how we use cuda.shared.array memory on the GPU
-- where I'm not aware of any analogues tricks for the cpu function -- that would
allow us to minimize the number of times each thread-block needs to access data in
the GPU's global memory. But such an implementation would take us deeper into the
weeds than this toy problem calls for.
Maybe if I need to take a break from my other work later I'll come back to this
and flesh out an example of what I mean.
"""
start_time_full_function = perf_counter()
@cuda.jit
def smooth_gpu(x, out):
"""slight change to the cuda kernel. This version uses **striding** to reduce
processor overhead spent allocating and deallocating a lot of thread blocks
that ultimately have each thread compute a single calculation before being
disposed of.
This way we offset some of the overhead cost spent on block allocation by
making each block do a bit more work.
Note: For this to work right, we have to allocate fewer blocks with
our `blockspergrid_j` and `blockspergrid_i` variables.
"""
jstart, istart = cuda.grid(2)
jstep, istep = cuda.gridsize(2)
rows,cols = x.shape
# note that for strided kernels, thread indices
# are completely independent of the data size/shape
for i in range(istart+1,rows-1,istep):
for j in range(jstart+1,cols-1,jstep):
# Because we created x and out using column-major memory ordering,
# we want to make sure the most frequently changing index (j)
# is iterating through the last dimension of the array.
out[i,j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
x[i , j - 1] + x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
_x = common_x.copy()
_out = np.zeros_like(_x)
stream = cuda.stream()
x_gpu = cuda.to_device(_x,stream)
out_gpu = cuda.to_device(_out,stream)
tpbj = 16
tpbi = 16
threadsperblock = tpbj,tpbi
blockspergrid_j = (_x.shape[0]+tpbj-1) // tpbj
blockspergrid_i = (_x.shape[1]+tpbi-1) // tpbi
# reduce the number of blocks in each axis
# by a quarter to give room for striding
blockspergrid = (blockspergrid_j//4, blockspergrid_i//4)
# run on gpu
compiled = smooth_gpu[blockspergrid, threadsperblock, stream] # compile before measuring time
start_time = perf_counter()
for _ in range(n_loops):
# realistically, we wouldn't typically run just a single blurring pass
compiled(x_gpu, out_gpu)
compiled(out_gpu,x_gpu)
x_gpu.copy_to_host(_out,stream)
stream.synchronize()
end_time = perf_counter()
end_time_full_function = perf_counter()
print(f"{gpu_s1:<{l}} {end_time-start_time:1.6f}s running {n_loops} loops"
f"\n{gpu_s2:<{l}} {end_time_full_function-start_time_full_function:1.6f}s")
return _out
if __name__ == '__main__':
a = time_njit(1)
b = time_cuda(1)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Outputs are equivalent")
a = time_njit(5)
b = time_cuda(5)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Results are equivalent")
a = time_njit(10)
b = time_cuda(10)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Results are equivalent")
a = time_njit(20)
b = time_cuda(20)
assert np.allclose(a,b),"The two functions didn't actually compute the same results"
print(f"{' '*4}Results are equivalent")
Output:
Some diagnostic info for the system running this script:
Found 1 CUDA devices
id 0 b'GeForce RTX 2080 Ti' [SUPPORTED]
compute capability: 7.5
pci device id: 0
pci bus id: 1
Summary:
1/1 devices are supported
Cpu info:
python_version: 3.8.8.final.0 (64 bit)
cpuinfo_version: [7, 0, 0]
cpuinfo_version_string: 7.0.0
arch: X86_64
bits: 64
count: 8
arch_string_raw: AMD64
vendor_id_raw: GenuineIntel
brand_raw: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
hz_advertised_friendly: 4.0000 GHz
hz_actual_friendly: 4.0010 GHz
hz_advertised: [4000000000, 0]
hz_actual: [4001000000, 0]
l2_cache_size: 1048576
stepping: 3
model: 60
family: 6
l3_cache_size: 8388608
flags: ['3dnow', 'abm', 'acpi', 'aes', 'apic', 'avx', 'avx2', 'bmi1', 'bmi2', 'clflush', 'cmov', 'cx16', 'cx8', 'de', 'dts', 'erms', 'est', 'f16c', 'fma', 'fpu', 'fxsr', 'ht', 'hypervisor', 'ia64', 'invpcid', 'lahf_lm', 'mca', 'mce', 'mmx', 'movbe', 'msr', 'mtrr', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pge', 'pni', 'popcnt', 'pse', 'pse36', 'rdrnd', 'sep', 'serial', 'smep', 'ss', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'tm', 'tm2', 'tsc', 'vme', 'xsave', 'xtpr']
l2_cache_line_size: 256
l2_cache_associativity: 6
Time comparisons for CPU vs GPU implementations:
CPU execution time: 0.327143s running 1 loops
CPU full setup/execution time: 0.980959s
GPU kernel execution time: 0.088015s running 1 loops
GPU full kernel setup/execution time: 0.868085s
Outputs are equivalent
CPU execution time: 1.539007s running 5 loops
CPU full setup/execution time: 2.134781s
GPU kernel execution time: 0.097627s running 5 loops
GPU full kernel setup/execution time: 0.695104s
Outputs are equivalent
CPU execution time: 3.463488s running 10 loops
CPU full setup/execution time: 4.310506s
GPU kernel execution time: 0.122363s running 10 loops
GPU full kernel setup/execution time: 0.655500s
Outputs are equivalent
CPU execution time: 6.416840s running 20 loops
CPU full setup/execution time: 7.011254s
GPU kernel execution time: 0.158903s running 20 loops
GPU full kernel setup/execution time: 0.723226s
Outputs are equivalent
CPU execution time: 9.285086s running 30 loops
CPU full setup/execution time: 9.890282s
GPU kernel execution time: 0.209807s running 30 loops
GPU full kernel setup/execution time: 0.728618s
Outputs are equivalent
CPU execution time: 12.610949s running 40 loops
CPU full setup/execution time: 13.177427s
GPU kernel execution time: 0.253696s running 40 loops
GPU full kernel setup/execution time: 0.836536s
Outputs are equivalent
CPU execution time: 15.376767s running 50 loops
CPU full setup/execution time: 15.976361s
GPU kernel execution time: 0.289626s running 50 loops
GPU full kernel setup/execution time: 0.841918s
Outputs are equivalent
Process finished with exit code 0
If I'm being honest, these results both agree and disagree with my expectations. I had expected at least the single loop function calls to see the CPU implementation outperform that of the GPU, but it doesn't. :v/ Though, the seemingly linear increase in time cost for the CPU as the number of loops increased was expected.
As for the GPU performance, I really don't know why the time cost for increasing loop counts seems to be logarithmic growth (I would have to plot the data points to see it more clearly).
Regardless, the results you see will vary according to your machine, but I would be curious at what cuda compute level the GPU results fall to match that of a CPU.
timeit
may actually be smearing some different behaviors together. For careful timing of a kernel-only execution in cupy or numba, I would suggest the method I indicate below: use device-resident arrays, and be careful to usecuda.synchronize()
to guard the timing region. – Robert Crovella