Nested kernels in CUDA

Question

CUDA currently does not allow nested kernels.

To be specific, I have the following problem: I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.

One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.

Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.

Thanks.

I cannot achieve fine grain parallelism. Let's say I am doing three different matrix operations on one data point. I can write kernel for each of them. Assuming one of the kernel is matrix multiplication C = A*B. The multiplication kernel will find each entry of C(i,j) in parallel. Which I cannot do when I have a big kernel with all three operations in it. What the big kernel will be doing is just working with data points in parallel. — Prasanna
you can certainly run multiple streams. is fairly straightforward, basically the 4th parameter to kernel launch is the stream. kernels launched on the same stream will execute sequentially, but kernels launched across different streams executed in nonsynchronized order. if you have specific question about implementation I could help you with that — Anycorn
Thanks aaa. Can you please refer me to some example of using the fourth parameter in kernel launch. — Prasanna

jmilloy jmilloy · Accepted Answer · 2010-12-14T14:34:00

Well, if you want to use streams... you will want to create N streams:

cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
    cudaStreamCreate(&streams[i]);
}

Then for the ith data point, you want to use cudaMemcpyAsync for transfers:

cudaMemcpyAsync(dst, src, kind, count, streams[i]);

and call your kernels with all four configuration parameters (sharedMemory can be 0, of course):

kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );

and of course cleanup:

for(i=0; i<N; i++)
{
    cudaStreamDestroy(streams[i]);
}
free(streams)

Nested kernels in CUDA

3 Answers