0
votes

To hide latency for calling cuda kernels, is there support for calling a sequence of kernels without having to return to the CPU to call the next kernel? So a sequence of kernels can be dequeued on the GPU device. This seems important when dealing with larger kernels where you might be hitting the instruction size limit and want to create more modularity to reduce the overall instruction size. (Where inlining might not be a good solution)

(In case its important, I'm using JCuda, if this creates a limitation in achieving this functionality please let me know.)

1

1 Answers

2
votes

What instruction size limit are you referring to? I'm not aware of one.

All CUDA kernel calls (<<<>>> or cuLaunch, etc.) are asynchronous, meaning control returns to the CPU immediately. The CUDA driver pushes hardware commands including kernel launches onto a command queue that the hardware dequeues from. Thus if you call multiple subsequent CUDA kernels with no other intervening CPU work or CUDA calls, they will be executed asynchronously by the GPU without "returning to the CPU", and control will return to the CPU immediately after the CUDA API enqueues the commands.

See the CUDA Programming Guide for more detail.