First of all, the basic operation you have shown is to take two matrices, transfer them to the GPU, do some elementwise multiplications to produce a 3rd array, and pass that 3rd array back to the host.
It may be possible to make a numba/cuda guvectorize (or cuda.jit kernel) implementation that might run faster than a naive serial python implementation, but I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing. This is because the arithmetic intensity per byte transferred between host and device is just too low. This operation is far too simple.
Secondly, it's important, I believe, to start out with an understanding of what numba vectorize
and guvectorize
are intended to do. The basic principle is to write the ufunc definition from the standpoint of "what will a worker do?" and then allow numba to spin up multiple workers from that. The way that you instruct numba to spin up multiple workers is to pass a data set that is larger than the signatures you have given. It should be noted that numba does not know how to parallelize a for-loop inside a ufunc definition. It gets parallel "strength" by taking your ufunc definition and running it among parallel workers, where each worker handles a "slice" of the data, but runs your entire ufunc definition on that slice. As some additional reading, I've covered some of this ground here also.
So a problem we have in your realization is that you have written a signature (and ufunc) which maps the entire input data set to a single worker. As @talonmies showed, your underlying kernel is being spun up with a total of 64 threads/workers (which is far to small to be interesting on a GPU, even apart from the above statements about arithmetic intensity), but I suspect in fact that 64 is actually just a numba minimum threadblock size, and in fact only 1 thread in that threadblock is doing any useful calculation work. That one thread is executing your entire ufunc, including all for-loops, in a serial fashion.
That's obviously not what anyone would intend for rational use of vectorize
or guvectorize
.
So let's revisit what you are trying to do. Ultimately your ufunc wants to multiply an input value from one array by an input value from another array and store the result in a 3rd array. We want to repeat that process many times. If all 3 array sizes were the same, we could actually realize this with vectorize
and would not even have to resort to the more complicated guvectorize
. Let's compare that approach to your original, focusing on the CUDA kernel execution. Here's a worked example, where t14.py is your original code, run with the profiler, and t15.py is a vectorize
version of it, acknowledging that we have changed the size of your multBy
array to match cv
and discount
:
$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
312.36ms 1.2160us - - - - - 8B 6.2742MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.81ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.52ms 5.8696ms - - - - - 15.259MB 2.5387GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.74ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.93ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
321.40ms 1.22538s (1 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s 7.1816ms - - - - - 15.259MB 2.0749GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
193.92ms 6.2729ms - - - - - 15.259MB 2.3755GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
201.09ms 5.7101ms - - - - - 15.259MB 2.6096GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
364.92ms 842.49us (15625 1 1) (128 1 1) 13 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms 7.1528ms - - - - - 15.259MB 2.0833GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
We see that your application reported a run-time of about 1.244 seconds, whereas the vectorize version reports a runtime of about 0.375 seconds. But there is python overhead in both of these numbers. If we look at the generated CUDA kernel duration in the profiler, the difference is even more stark. We see that the original kernel took about 1.225 seconds whereas the vectorize kernel executes in about 842 microseconds (i.e. less than 1 millisecond). We also note that the computation kernel time is now much, much smaller than the time it takes to transfer the 3 arrays to/from the GPU (which takes about 20 millisconds total) and we note that the kernel dimensions are now 15625 blocks of 128 threads each for a total thread/worker count of 2000000, exactly matching the total number of multiply operations to be done, and substantially more than the paltry 64 threads (and possibly, really only 1 thread) in action with your original code.
Given the simplicity of the above vectorize
approach, if what you really want to do is this element-wise multiplication, then you might consider just replicating multBy
so that it is dimensionally matching the other two arrays, and be done with it.
But the question remains: how to handle dissimilar input array sizes, as in the original problem? For that I think we need to go to guvectorize
(or, as @talonmies indicated, write your own @cuda.jit
kernel, which is probably the best advice, notwithstanding the possibility that none of these approaches may overcome the overhead of transferring data to/from the device, as already mentioned).
In order to tackle this with guvectorize
, we need to think more carefully about the "slicing" concept already mentioned. Let's re-write your guvectorize
kernel so that it only operates on a "slice" of the overall data, and then allow the guvectorize
launch function to spin up multiple workers to tackle it, one worker per slice.
In CUDA, we like to have lots of workers; you really can't have too many. So this will affect how we "slice" our arrays, so as to give the possibility for multiple workers to act. If we were to slice along the 3rd (last, n
) dimension, we would only have 5 slices to work with, so a maximum of 5 workers. Likewise if we slice along the first, or countRow
dimension, we would have 100 slices, so a maximum of 100 workers. Ideally, we would slice along the 2nd, or countCol
dimension. However for simplicity, I will slice along the first, or countRow
dimension. This is clearly non-optimal, but see below for a worked example of how you might approach the slicing-by-second-dimension problem. Slicing by the first dimension means we will remove the first for-loop from our guvectorize kernel, and allow the ufunc system to parallelize along that dimension (based on sizes of arrays we pass). The code could look something like this:
$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
307.05ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
307.79ms 5.9293ms - - - - - 15.259MB 2.5131GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.34ms 1.3440us - - - - - 8B 5.6766MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.54ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
317.27ms 47.398ms (2 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms 7.3799ms - - - - - 15.259MB 2.0192GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
Observations:
The code changes were related to removing the countCol
parameter, removing the first for-loop from the guvectorize kernel, and making the appropriate changes to the function signature to reflect this. We also modified our 3-dimensional functions in the signature to two-dimensional. We are taking a two-dimensional "slice" of the 3-dimensional data, after all, and letting each worker work on a slice.
The kernel dimensions as reported by the profiler are now 2 blocks instead of 1. This makes sense, because in the original realization, there was really only 1 "slice" presented, and therefore 1 worker needed, and therefore 1 thread (but numba spun up 1 threadblock of 64 threads). In this realization, there are 100 slices, and numba chose to spin up 2 threadblocks of 64 workers/threads, to provide the needed 100 workers/threads.
The kernel performance reported by the profiler of 47.4ms is now somewhere in between the original (~1.224s) and the massively parallel vectorize
version (at ~0.001s). So going from 1 to 100 workers has sped things up considerably, but there are more performance gains possible. If you figure out how to slice on the countCol
dimension, you can probably get closer to the vectorize
version, performance-wise (see below). Note that the difference between where we are at here (~47ms) and the vectorize version (~1ms) is more than enough to make up for the additional transfer cost (~5ms, or less) of transferring a slightly larger multBy
matrix to the device, to facilitate the vectorize
simplicity.
Some additional comments on the python timing: I believe the exact behavior of how python is compiling the necessary kernels for the original, vectorize, and guvectorize improved versions is different. If we modify the t15.py code to run a "warm-up" run, then at least the python timing is consistent, trend-wise with the overall wall time and the kernel-only timing:
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839
real 0m2.522s
user 0m1.572s
sys 0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091
real 0m1.050s
user 0m0.473s
sys 0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283
real 0m1.252s
user 0m0.680s
sys 0m0.441s
$
Now, responding to a question in the comments, effectively: "How would I recast the problem to slice along the 4000 (countCol
, or "middle") dimension?"
We can be guided by what worked to slice along the first dimension. One possible approach would be to rearrange the shape of the arrays so that the 4000 dimension was the first dimension, then remove that, similar to what we did in the previous treatment of guvectorize
. Here's a worked example:
$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[num] * discount[ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
Somewhat predictably, we observe that the execution time has dropped from ~47ms when we sliced into 100 workers to ~9ms when we slice into 4000 workers. Similarly, we observe that numba is choosing to spin up 63 blocks of 64 threads each for a total of 4032 threads, to handle the 4000 workers needed for this "slicing".
Still not as fast as the ~1ms vectorize
kernel (which has many more available parallel "slices" for workers), but quite a bit faster than the ~1.2s kernel proposed in the original question. And the overall walltime of the python code is about 2x faster, even with all the python overhead.
As a final observation, let's revisit the statement I made earlier (and is similar to statements made in the comment and in the other answer):
"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."
We now have convenient test cases in either t16.py or t17.py that we could work with to test this. For simplicity I'll choose t16.py. We can "convert this back to a CPU code" simply by removing the target designation from the guvectorize
ufunc:
$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741
real 0m0.528s
user 0m0.474s
sys 0m0.047s
$
So we see that this CPU-only version runs the function in about 6 milliseconds, and it has no GPU "overhead" such as CUDA initialization, and copy of data to/from GPU. The overall walltime is also our best measurement, at about 0.5s compared to about 1.0s for our best GPU case. So this particular problem, due to its low arithmetic intensity per byte of data transfer, probably isn't well-suited to GPU computation.