CUDA 5.x on Kepler, dynamic kernel execution and maximum recursion “depth”

Question

In the CUDA 5 programming guide, the following is said:

Launches may continue to a depth of 24 generations, but this depth will typically be limited by available resources on the GPU

My questions are the following:

does the CUDA runtime on the GPU guarantee that a depth of 24 can always be achieved and that, in some circumstances, might even go beyond 24 (case A)? Or do they mean 24 is the absolute maximum limit and this number might not indeed be reached at runtime (case B)?
if case B, what happens when a kernel is launched on the GPU and there is not enough resources? Does the launch fail? (weird if this is the case!)

I plan on writing a CUDA program and I would like to take benefit from the Kepler architecture. My algorithm absolutely needs function recursion at a level of 15-19 typically (the recursion level is bound to my data structures).

Ref: TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

It is case B and it should result in a launch failure from resource exhaustion (although I have tried it to confirm that is the case). The GPU has finite allocations for runtime heap and local memory. Eventually you will run out of them... — talonmies
function recursion and dynamic parallelism are related ideas but they are not the same thing. It's possible to have function recursion in CUDA without Kepler (or dynamic parallelism). Dynamic parallelism comes into play anytime a kernel launch is issued from device code. But I can also create an ordinary __device__ function, and use it recursively. You just won't have kernels launching in this case at each round of recursion. I tried a simple factorial recursive impelementation in cuda, and was able to recurse up to a depth of 20, the limit of long int to store the result. need cc2.0 — Robert Crovella
The CUDA Dynamic Programming Guide section Implementation Restrictions and Limitations covers the topics of nesting level. — Greg Smith
@Robert: dynamic parallelism will load-balance the work across GPU resources, whereas function recursion will not. This is a big difference. I'll most likely have 1 work item for depth 1, 4 for depth 2, 16 for depth 3, then 64,256,1024,4096 and so on. Definitely not exactly following factors of 4 though, might be 1,14,220,750,2985 and so on. — Martin Frank

Maciej Piechotka Maciej Piechotka · Accepted Answer · 2013-01-13T11:07:26

CUDA does not guarantee that recursion depth 1 will be achieved - similarly as traditional OS does not guarantee that launching new process/thread will succeeds. For example if you have following program:

int main() {
    pid_t pid;
    while (pid = fork ());
    while (true) {
        dummy<<<1, 1024>>> ();
    }
}

__global void dummy() {}

At some point something will fail - either you will run out of CPU or GPU memory. In similar way while on GPU it might happen that it will fail (return error - either CUDA or fork will return -1).

Another way of looking on it is that each launch can have (2^31-1)^2*(2^10-1) ≃ 2^72 blocks each with 2^10 threads in worst case. I.e. in single launch you can have 2^82 threads. Now each recursion is exponential hence even if you terminate thread after launching in worst case it would need to guarantee scheduling 2^1968 threads. If state of each thread was 1/32 bit, if warp finished or not, it would require 2^1945 GiB of memory (which is "slightly" more then informational capacity of observed universe - namely 2^1595 times more).

Hence it is definitely case B and there is no sane possibility of it being case A (state of warp must include at least the instruction pointer). Depending on the branching factor, and if you synchronize, of your algorithm 15-19 recursion depth might be achievable.

EDIT: If you mean plain recursion instead of recursive launches than in practice it is limited by stack. Depending on the exact code it might be practically infinite on Fermi+ (Tesla generation does not support recursion IIRC). Similarly there is no guaranteed minimal depth - allocate large array on stack/local memory and you will run out of space (optimizer does good job of getting rid of it).

CUDA 5.x on Kepler, dynamic kernel execution and maximum recursion “depth”

1 Answers