Priority of concurrent CUDA kernel execution

Question

I have two kernels (A and B) that can be executed concurrently. I need kernel A to finish as soon as possible (to do MPI exchange of the result). So I can execute them in one stream: A and then B.

However, kernel A has few thread blocks, so if I run A and B sequentially, GPU is not fully utilized while A is running.

Is it possible to execute A and B concurrently with A having higher priority?

I. e., I want thread blocks from kernel B to start executing only if there are no non-started blocks from kernel A.

As I understand, if I start kernel A in one stream, and, next line in host code, start kernel B in another stream, I am not guaranteed that thread blocks from B will not actually be executed first?

CUDA programming model version 5.0 and early does not support stream priorities. The current CUDA 5.0 driver on compute capability 2.0 - 3.0 will dispatch commands to the GPU in the order that the work was submitted on the CPU. This may not be true in future drivers. Compute capability 3.5 devices may execute the work out of order. — Greg Smith

solvingPuzzles solvingPuzzles · Accepted Answer · 2013-10-20T02:42:10

NVIDIA now provides a way to prioritize CUDA kernels. This is a fairly new feature, so you'll need to upgrade to CUDA 5.5 for this to work.

For your case, you would launch kernel A in a high-priority CUDA stream, and you launch kernel B in a low-priority CUDA stream. The function you'll probably want is cudaStreamCreateWithPriority(..., priority).

To use this functionality, you'll need a GPU with Compute Capability 3.5 or higher. To check whether priorities are supported on your GPU, look at cudaDeviceProp::streamPrioritiesSupported.
cudaDeviceGetStreamPriorityRange should tell you how many priority levels are available on your GPU. The syntax for cudaDeviceGetStreamPriorityRange is a bit wonky; it's worth looking in the CUDA manual to see how this works.

More detailed documentation on priority settings from the CUDA Runtime API manual:

cudaError_t cudaStreamCreateWithPriority(cudaStream_t *pStream, 
                                         unsigned int flags, int priority)
Create an asynchronous stream with the specified priority.

Parameters
pStream  = Pointer to new stream identifier 
flags    = Flags for stream creation. See cudaStreamCreateWithFlags for a list of 
           valid flags that can be passed 
priority = Priority of the stream. Lower numbers represent higher priorities. See  
           cudaDeviceGetStreamPriorityRange for more information about the 
           meaningful stream priorities that can be passed.

Priority of concurrent CUDA kernel execution

1 Answers