Benefit of splitting a big CUDA kernel and using dynamic parallelism

Question

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.

The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.

I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.

I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job. Is it fast to launch a kernel from a kernel? Do I need to copy data in global memory to make them available to a sub-kernel?

If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?

I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.

EDIT: To provide some other details, you can imagine my kernel to have this kind of structure:

__global__ void kernel(int *sampleData, int *initialData) {
    __shared__ int systemState[N];
    __shared__ int someTemp[N * 3];
    __shared__ int time;
    int tid = ...;
    systemState[tid] = initialData[tid];

    while (time < TIME_END) {
        bool c = calc_something(systemState);
        if (c)
            break;
        someTemp[tid] = do_something(systemState);
        c = do_check(someTemp);
        if (__syncthreads_or(c))
            break;
        sample(sampleData, systemState);
        if (__syncthreads_and(...)) {
            do_something(systemState);
            sync();
            time += some_increment(systemState);
        }
        else {
            calcNewTemp(someTemp, systemState);
            sync();
            do_something_else(someTemp, systemState);
            time += some_other_increment(someTemp, systemState);
        }
    }
    do_some_stats();
}

this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.

Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.

Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?

I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.

I'd say streams are a good fit for your problem. But if you are keen on trying dynamic parallelism, go ahead. I'm still looking to see an application where they actually provide a speedup. — tera
Thanks, @tera. Do you have a good resource about using streams in a situation like this? I can't figure much out of the C Programming guide, and I am still unsure about how to use them. I will now add some other info about how my kernel is structured in the question. — AkiRoss

Vitality Vitality · Accepted Answer · 2013-06-17T13:36:46

I have benefitted from dynamic parallelism for solving an interpolation problem of the form:

int i = threadIdx.x + blockDim.x * blockIdx.x;

for(int m=0; m<(2*K+1); m++) {

    PP1 = calculate_PP1(i,m);
    phi_cap1 = calculate_phi_cap1(i,m);  

        for(int n=0; n<(2*K+1); n++) {

            PP2 = calculate_PP2(i,m);
            phi_cap2 = calculate_phi_cap2(i,n);

            atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }

where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.

From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

Benefit of splitting a big CUDA kernel and using dynamic parallelism

1 Answers