CUDA currently does not allow nested kernels.
To be specific, I have the following problem: I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.
One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.
Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.
Thanks.