3
votes

CUDA currently does not allow nested kernels.

To be specific, I have the following problem: I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.

One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.

Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.

Thanks.

3
Whats wrong with a big kernel?Anycorn
I cannot achieve fine grain parallelism. Let's say I am doing three different matrix operations on one data point. I can write kernel for each of them. Assuming one of the kernel is matrix multiplication C = A*B. The multiplication kernel will find each entry of C(i,j) in parallel. Which I cannot do when I have a big kernel with all three operations in it. What the big kernel will be doing is just working with data points in parallel.Prasanna
you can certainly run multiple streams. is fairly straightforward, basically the 4th parameter to kernel launch is the stream. kernels launched on the same stream will execute sequentially, but kernels launched across different streams executed in nonsynchronized order. if you have specific question about implementation I could help you with thatAnycorn
Thanks aaa. Can you please refer me to some example of using the fourth parameter in kernel launch.Prasanna

3 Answers

3
votes

Well, if you want to use streams... you will want to create N streams:

cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
    cudaStreamCreate(&streams[i]);
}

Then for the ith data point, you want to use cudaMemcpyAsync for transfers:

cudaMemcpyAsync(dst, src, kind, count, streams[i]);

and call your kernels with all four configuration parameters (sharedMemory can be 0, of course):

kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );

and of course cleanup:

for(i=0; i<N; i++)
{
    cudaStreamDestroy(streams[i]);
}
free(streams)
2
votes

As an update to the selected answer, NVidia's GPU with Compute Capability 3.5 now allows nested kernels, Dynamic Parallelism as they call it.

0
votes

Nowadays, with the Fermi compatibility, it is possible to launch parallel kernel