I dug into this a bit, so I'll share what I have. Whether or not it is a complete answer I'm not sure, but it may address some of your questions.
Probably the best numba-based approach for this is to write your own "custom" CUDA kernel using numba CUDA (jit). An example of this is here for reduction or here for matrix multiply. To do this correctly would require learning something about CUDA programming. This didn't seem to be the direction you wanted to go in however.
As an alternative your question is focusing on using numba vectorization to generate GPU code. numba vectorize decorator is used for functions which operate on scalar input and output, and the vectorization applies them to matrix input/matrix output.
For functions that don't fit this, e.g. those that operate on a vector or a scalar but produce a vector, or those that operate on one or two vectors, and produce a vector or scalar output, numba provides the generalized guvectorize.
Starting with your simplest example of what you want to do, we could realize this with guvectorize
:
What I tried so far is to write the function in a non-vectorized form (i.e. treat each entry of X separately), and pass the return array as an input:
def foo_cuda(x,N,out):
'''
:param x: Scalar
:param N: Integer
:rtype: 1D numpy array of length N
'''
out[0] = x
for i in range(1,N):
out[i] = x**i+out[i-1]
This intent, to take a scalar (per function call) and return a vector (per function call) falls within the capability of guvectorize
(with some limitations/caveats - see notes at bottom).
Here is a worked example, derived from the sample code here:
# cat t2.py
from __future__ import print_function
import sys
import numpy as np
from numba import guvectorize, cuda
if sys.version_info[0] == 2:
range = xrange
# function type:
# - has void return type
#
# signature: (n)->(n)
# - the function takes an array of n-element and output same.
@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
tmp1 = 0.
tmp = inp[0]
for i in range(out.shape[0]):
tmp1 += tmp
out[i] = tmp1
tmp *= inp[0]
# set up input data
rows = 1280000 # shape[0]
cols = 4 # shape[1]
inp = np.zeros(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
inp[i,0] = (i%4)+1
# invoke on CUDA with manually managed memory
dev_inp = cuda.to_device(inp) # alloc and copy input data
my_func(dev_inp, dev_inp) # invoke the gufunc
dev_inp.copy_to_host(inp) # retrieve the result
# print out
print('result'.center(80, '-'))
print(inp)
# nvprof --print-gpu-trace python t2.py
==4773== NVPROF is profiling process 4773, command: python t2.py
-------------------------------------result-------------------------------------
[[ 1. 2. 3. 4.]
[ 2. 6. 14. 30.]
[ 3. 12. 39. 120.]
...
[ 2. 6. 14. 30.]
[ 3. 12. 39. 120.]
[ 4. 20. 84. 340.]]
==4773== Profiling application: python t2.py
==4773== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
994.08ms 5.5731ms - - - - - 19.531MB 3.4224GB/s Pageable Device Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00083s 159.20us (20000 1 1) (64 1 1) 22 0B 0B - - - - Tesla P100-PCIE 1 7 cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>) [48]
1.00100s 4.8017ms - - - - - 19.531MB 3.9722GB/s Device Pageable Tesla P100-PCIE 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
# nvprof --metrics gst_efficiency python t2.py
==4787== NVPROF is profiling process 4787, command: python t2.py
==4787== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)" (done)
-------------------------------------result-------------------------------------
[[ 1. 2. 3. 4.]
[ 2. 6. 14. 30.]
[ 3. 12. 39. 120.]
...
[ 2. 6. 14. 30.]
[ 3. 12. 39. 120.]
[ 4. 20. 84. 340.]]
==4787== Profiling application: python t2.py
==4787== Profiling result:
==4787== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla P100-PCIE-16GB (0)"
Kernel: cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)
1 gst_efficiency Global Memory Store Efficiency 25.00% 25.00% 25.00%
#
To respond to your specific question in the context of guvectorize
:
What is the correct decorator to use for my purpose?
The function type specification looks like this:
['<return-type>(<parameter 0 type>, <parameter 1 type>, ...)']
For guvectorize
, the return type should always be void
. The types of the parameters should match the types you intend the function to be used with. With guvectorize
we think about types as a "slice" of the actual input and output data types we will pass, the "slice" being what an individual function call will operate on. The vectorization then applies individual function calls to each "slice" of the input/output data, to cover the entire size of the input/output data set. For my example, then, I am proposing to pass an input vector (float32[:]
) and an output vector (float32[:]
).
The function signature shows the dimensions of the input, followed by the dimensions of the output:
(x)...->(x)
Each can be multidimensional (although should still represent only a "slice" of the input/output, for vectorization), and scalar can be represented by ()
. A wrinkle arises here, since we want the output "slice" to be a vector of results, of some length, say n
. numba guvectorize
does not seem to allow the specification of an output dimension (n
) that is not part of an already specified input dimension. So while this function only really needs a scalar input, I have chosen to work around this by passing a vector "slice" for input and output. In fact this function, the way I have written it, can use the same data for input and output, so there isn't really "overhead" to this workaround, for this particular case.
Some notes about implementation/performance:
This "parallelizes" across the first array (shape[0]
) dimension. All of the computations are being performed on the GPU. The computations to perform each vector slice of the output are being performed by a single thread on the GPU. But for a large data set (first dimension) this will expose plenty of parallel work (threads) for the GPU. The work in the second dimension, i.e. the loop, is operating in the context of each thread. Attempting to parallelize in the second dimension (creating e.g. a prefix sum) would almost certainly not be possible using vectorize
or guvectorize
but should be possible using numba cuda (jit). We can confirm the parallelization dimension by studying the (first) nvprof --print-gpu-trace
output in the above worked example, and noting that it constitutes 20000 blocks of 64 threads each, for a total of 1280000 threads, matching our first array dimension.
As mentioned already, this implementation is somewhat hacky as I am passing a vector input even though I only need and use a scalar. This is to work around what seems to be a limitation in numba that you cannot specify a dimension signature like ()->(n)
as far as I can tell. (Note: after further study, I think the right thing to do here is to define an input-only ufunc, passing the input vector/matrix as just a single argument instead of twice, and using just (n)
as the dimension signature, rather than (n)->(n)
. See here ).
This implementation is not optimal from a memory access pattern point of view. This is evident from the (second) nvprof --metrics gst_efficiency
output. I think the reason for this is also possibly a limitation in numba design (currently). When numba is presented an array in this example for distribution/parallelization via guvectorize
, it slices the array into rows, and passes each row to a function call to process. For this particular example, a much more efficient implementation would be to transpose our array storage system, and have the array sliced into columns, and let each function call work on a column. The reason for this has to do with GPU behavior and design, which I won't review here. However the metric shows that we're only achieving 25% efficiency with this row-per-thread access pattern. It would be easy to achieve 100% efficiency with transposed data and a column-per-thread approach, but I don't know how to achieve this with numba guvectorize
. The only option I know of to resolve the inefficient access would be to revert to writing a numba CUDA kernel (i.e. CUDA jit). (Another alternative I tried to investigate was to see if we could instruct the ufunc to take a column-vector in its signature, e.g. something like void(float32[[:]],float32[[:]])
, hoping that numba would slice things by columns, but had no luck with that.)
Using the previous example as a model, we can create something similar that solves your "actual function" in Update 2, using guvectorize
. I've reduced the array sizes here to 5 (= len(X)
) by 4 (= N
):
# cat t3.py
from __future__ import print_function
import sys
import numpy as np
import numba
from numba import guvectorize, cuda
import math
if sys.version_info[0] == 2:
range = xrange
@numba.jit(nopython = True)
def foo(X,N):
'''
:param X: 1D numpy array
:param N: Integer >= 2
:rtype: 2D numpy array of shape len(X) x N
'''
out = np.ones((X.shape[0],N))
out[:,1] = X
for i in range(2,N):
out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2]
c = 1
for i in range(2,N):#Note that this loop cannot be combined with the one above!
c *= i
out[:,i] /= math.sqrt(c)
return out
# function type:
# - has void return type
#
# signature: (n)->(n)
# - the function takes an array of n-element and output same.
@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
for i in range(2,out.shape[0]):
out[i] = out[1]*out[i-1] - (i-1)*out[i-2]
c = 1.
for i in range(2,out.shape[0]):
c *= i
out[i] /= math.sqrt(c)
# set up input data
rows = 5 # shape[0]
cols = 4 # shape[1]
inp = np.ones(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
inp[i,1] = i
# invoke on CUDA with manually managed memory
dev_inp = cuda.to_device(inp) # alloc and copy input data
my_func(dev_inp, dev_inp) # invoke the gufunc
dev_inp.copy_to_host(inp) # retrieve the result
# print out
print('gpu result'.center(80, '-'))
print(inp)
rrows = rows
rcols = cols
rin = np.zeros(rrows, dtype=np.float32)
for i in range(rin.shape[0]):
rin[i] = i
rout = foo(rin, rcols)
print('cpu result'.center(80, '-'))
print(rout)
# python t3.py
-----------------------------------gpu result-----------------------------------
[[ 1. 0. -0.70710677 -0. ]
[ 1. 1. 0. -0.8164966 ]
[ 1. 2. 2.1213202 0.8164966 ]
[ 1. 3. 5.656854 7.3484693 ]
[ 1. 4. 10.606602 21.22891 ]]
-----------------------------------cpu result-----------------------------------
[[ 1. 0. -0.70710678 -0. ]
[ 1. 1. 0. -0.81649658]
[ 1. 2. 2.12132034 0.81649658]
[ 1. 3. 5.65685425 7.34846923]
[ 1. 4. 10.60660172 21.2289111 ]]
#
This answer has additional discussion on the concept of "slices" as they pertain to vectorize
and guvectorize
, and walks through a number of examples.
vectorize
,guvectorize
, andjit
are three very different things and it isn't at all obvious which one of these you actually are trying to write (and saying "run on my GPU") is completely unhelpful. They all can run on the GPU in one form or another). Read numba.pydata.org/numba-doc/dev/user/vectorize.html – talonmiesvectorize
adds another layer where numba takes care of the vectorization by itself, so I don't have to write loops overX
. The difference betweenvectorize
andguvectorize
I don't know, except thatguvectorize
seems to be more flexible (believe me, the description in your link, which I already had seen, is not very helpful for beginners. I couldn't find a more complete documentation) – Bananachlen(X)
is normally much larger thanN
– Bananach