To hide latency for calling cuda kernels, is there support for calling a sequence of kernels without having to return to the CPU to call the next kernel? So a sequence of kernels can be dequeued on the GPU device. This seems important when dealing with larger kernels where you might be hitting the instruction size limit and want to create more modularity to reduce the overall instruction size. (Where inlining might not be a good solution)
(In case its important, I'm using JCuda, if this creates a limitation in achieving this functionality please let me know.)