Numba.vectorize for CUDA: What is the correct signature to return arrays?

Question

I have a function of the following structure,

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((len(X),N))
    out[:,0]  = X 
    for i in range(1,N):
        out[:,i] = X**i+out[:,i-1] 
    return out

which I am now trying to run on my GPU. What I tried so far is to write the function in a non-vectorized form (i.e. treat each entry of X separately), and pass the return array as an input:

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

However, I don't know what decorator to use for that function. If I use

@numba.vectorize([(float64,int64,float64[:])],target = 'cuda') I get TypeError: Buffer dtype cannot be buffer
@numba.guvectorize([(float64,int64,float64[:])],'(),()->(n)',target = 'cuda') I get NameError: undefined output symbols: n

What is the correct decorator to use for my purpose?

I would like to be able to call foo_cuda in roughly the same way as foo at the end, i.e. pass a 1D array X, an integer N and a 2D array out that gets filled with the results.

UPDATE

The numpy.vectorize version of my function would be

def foo_np(x,N):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out = np.zeros(N)
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]
    return out
foo_ve = np.vectorize(foo_np,signature='(),()->(n)')

However, I cannot create an output array (out = np.zeros(N)) in numba (cuda.local.array(N,dtype=float64) fails as well), which precludes me from using @numba.vectorize('void(float64,int64)',target='cuda'). I tried to fix this by passing an the output array to the function and adding this to the signature (see attempt 1. above), but I got an error.

UPDATE 2

The actual function is as follows:

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2] 
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

What is your "purpose"? vectorize, guvectorize, and jit are three very different things and it isn't at all obvious which one of these you actually are trying to write (and saying "run on my GPU") is completely unhelpful. They all can run on the GPU in one form or another). Read numba.pydata.org/numba-doc/dev/user/vectorize.html — talonmies
@talonmies All I know is that I want to accelerate the code. How to achieve that best, I don't know. If all options were working, I could try it out, but I can't get any working. From my understanding, vectorize adds another layer where numba takes care of the vectorization by itself, so I don't have to write loops over X. The difference between vectorize and guvectorize I don't know, except that guvectorize seems to be more flexible (believe me, the description in your link, which I already had seen, is not very helpful for beginners. I couldn't find a more complete documentation) — Bananach
@talonmies Since it probably matters, len(X) is normally much larger than N — Bananach
Your understanding of what vectorize does is incorrect. All of these concepts are more or less identical to those in plain numpy. Perhaps this helps: stackoverflow.com/questions/3379301/… . Otherwise there is no answer to your question, I'm afraid. You seems to be trying to ask "What decorator will deliver my free lunch". As with all things in life, there is no free lunch. — talonmies
numba cuda vectorize and guvectorize are basically expecting to do elementwise operations, with no interdependencies between computed results. prefix sum doesn't really fit that template. Although guvectorize generalizes this a bit, it still doesn't provide a mechanism to exchange intermediate computed results between threads, which is useful for things like parallel reduction and prefix-sum. This sort of cooperative thread work can be set up using numba cuda python (see previous linked example for parallel reduction) and I believe it could be used to create a prefix sum. — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2018-04-04T22:08:43

I dug into this a bit, so I'll share what I have. Whether or not it is a complete answer I'm not sure, but it may address some of your questions.

Probably the best numba-based approach for this is to write your own "custom" CUDA kernel using numba CUDA (jit). An example of this is here for reduction or here for matrix multiply. To do this correctly would require learning something about CUDA programming. This didn't seem to be the direction you wanted to go in however.

As an alternative your question is focusing on using numba vectorization to generate GPU code. numba vectorize decorator is used for functions which operate on scalar input and output, and the vectorization applies them to matrix input/matrix output.

For functions that don't fit this, e.g. those that operate on a vector or a scalar but produce a vector, or those that operate on one or two vectors, and produce a vector or scalar output, numba provides the generalized guvectorize.

Starting with your simplest example of what you want to do, we could realize this with guvectorize:

What I tried so far is to write the function in a non-vectorized form (i.e. treat each entry of X separately), and pass the return array as an input:

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

This intent, to take a scalar (per function call) and return a vector (per function call) falls within the capability of guvectorize (with some limitations/caveats - see notes at bottom).

Here is a worked example, derived from the sample code here:

# cat t2.py
from __future__ import print_function

import sys

import numpy as np

from numba import guvectorize, cuda

if sys.version_info[0] == 2:
    range = xrange


#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        tmp1 = 0.
        tmp = inp[0]
        for i in range(out.shape[0]):
            tmp1 += tmp
            out[i] = tmp1
            tmp *= inp[0]

# set up input data
rows = 1280000 # shape[0]
cols = 4   # shape[1]
inp = np.zeros(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,0] = (i%4)+1
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('result'.center(80, '-'))
print(inp)

# nvprof --print-gpu-trace python t2.py
==4773== NVPROF is profiling process 4773, command: python t2.py
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4773== Profiling application: python t2.py
==4773== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
994.08ms  5.5731ms                    -               -         -         -         -  19.531MB  3.4224GB/s    Pageable      Device  Tesla P100-PCIE         1         7  [CUDA memcpy HtoD]
1.00083s  159.20us          (20000 1 1)        (64 1 1)        22        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>) [48]
1.00100s  4.8017ms                    -               -         -         -         -  19.531MB  3.9722GB/s      Device    Pageable  Tesla P100-PCIE         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

# nvprof --metrics gst_efficiency python t2.py
==4787== NVPROF is profiling process 4787, command: python t2.py
==4787== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)" (done)
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4787== Profiling application: python t2.py
==4787== Profiling result:
==4787== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
#

To respond to your specific question in the context of guvectorize:

What is the correct decorator to use for my purpose?

The function type specification looks like this:

['<return-type>(<parameter 0 type>, <parameter 1 type>, ...)']

For guvectorize, the return type should always be void. The types of the parameters should match the types you intend the function to be used with. With guvectorize we think about types as a "slice" of the actual input and output data types we will pass, the "slice" being what an individual function call will operate on. The vectorization then applies individual function calls to each "slice" of the input/output data, to cover the entire size of the input/output data set. For my example, then, I am proposing to pass an input vector (float32[:]) and an output vector (float32[:]).

The function signature shows the dimensions of the input, followed by the dimensions of the output:

(x)...->(x)

Each can be multidimensional (although should still represent only a "slice" of the input/output, for vectorization), and scalar can be represented by (). A wrinkle arises here, since we want the output "slice" to be a vector of results, of some length, say n. numba guvectorize does not seem to allow the specification of an output dimension (n) that is not part of an already specified input dimension. So while this function only really needs a scalar input, I have chosen to work around this by passing a vector "slice" for input and output. In fact this function, the way I have written it, can use the same data for input and output, so there isn't really "overhead" to this workaround, for this particular case.

Some notes about implementation/performance:

This "parallelizes" across the first array (shape[0]) dimension. All of the computations are being performed on the GPU. The computations to perform each vector slice of the output are being performed by a single thread on the GPU. But for a large data set (first dimension) this will expose plenty of parallel work (threads) for the GPU. The work in the second dimension, i.e. the loop, is operating in the context of each thread. Attempting to parallelize in the second dimension (creating e.g. a prefix sum) would almost certainly not be possible using vectorize or guvectorize but should be possible using numba cuda (jit). We can confirm the parallelization dimension by studying the (first) nvprof --print-gpu-trace output in the above worked example, and noting that it constitutes 20000 blocks of 64 threads each, for a total of 1280000 threads, matching our first array dimension.
As mentioned already, this implementation is somewhat hacky as I am passing a vector input even though I only need and use a scalar. This is to work around what seems to be a limitation in numba that you cannot specify a dimension signature like ()->(n) as far as I can tell. (Note: after further study, I think the right thing to do here is to define an input-only ufunc, passing the input vector/matrix as just a single argument instead of twice, and using just (n) as the dimension signature, rather than (n)->(n). See here ).
This implementation is not optimal from a memory access pattern point of view. This is evident from the (second) nvprof --metrics gst_efficiency output. I think the reason for this is also possibly a limitation in numba design (currently). When numba is presented an array in this example for distribution/parallelization via guvectorize, it slices the array into rows, and passes each row to a function call to process. For this particular example, a much more efficient implementation would be to transpose our array storage system, and have the array sliced into columns, and let each function call work on a column. The reason for this has to do with GPU behavior and design, which I won't review here. However the metric shows that we're only achieving 25% efficiency with this row-per-thread access pattern. It would be easy to achieve 100% efficiency with transposed data and a column-per-thread approach, but I don't know how to achieve this with numba guvectorize. The only option I know of to resolve the inefficient access would be to revert to writing a numba CUDA kernel (i.e. CUDA jit). (Another alternative I tried to investigate was to see if we could instruct the ufunc to take a column-vector in its signature, e.g. something like void(float32[[:]],float32[[:]]), hoping that numba would slice things by columns, but had no luck with that.)

Using the previous example as a model, we can create something similar that solves your "actual function" in Update 2, using guvectorize. I've reduced the array sizes here to 5 (= len(X)) by 4 (= N):

# cat t3.py
from __future__ import print_function

import sys
import numpy as np
import numba

from numba import guvectorize, cuda
import math

if sys.version_info[0] == 2:
    range = xrange

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2]
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        for i in range(2,out.shape[0]):
            out[i] = out[1]*out[i-1] - (i-1)*out[i-2]
        c = 1.
        for i in range(2,out.shape[0]):
            c *= i
            out[i] /= math.sqrt(c)

# set up input data
rows = 5   # shape[0]
cols = 4   # shape[1]
inp = np.ones(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,1] = i
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('gpu result'.center(80, '-'))
print(inp)

rrows = rows
rcols = cols
rin = np.zeros(rrows, dtype=np.float32)
for i in range(rin.shape[0]):
    rin[i] = i
rout = foo(rin, rcols)
print('cpu result'.center(80, '-'))
print(rout)


# python t3.py
-----------------------------------gpu result-----------------------------------
[[ 1.          0.         -0.70710677 -0.        ]
 [ 1.          1.          0.         -0.8164966 ]
 [ 1.          2.          2.1213202   0.8164966 ]
 [ 1.          3.          5.656854    7.3484693 ]
 [ 1.          4.         10.606602   21.22891   ]]
-----------------------------------cpu result-----------------------------------
[[ 1.          0.         -0.70710678 -0.        ]
 [ 1.          1.          0.         -0.81649658]
 [ 1.          2.          2.12132034  0.81649658]
 [ 1.          3.          5.65685425  7.34846923]
 [ 1.          4.         10.60660172 21.2289111 ]]
#

This answer has additional discussion on the concept of "slices" as they pertain to vectorize and guvectorize, and walks through a number of examples.

Numba.vectorize for CUDA: What is the correct signature to return arrays?

1 Answers