CUDA dynamic parallelism and the vortical component calculation from Biot-Savart law

Question

I'm solving the Biot Savart law using CUDA, determining the velocity at NP points induced by NV line element vortices (where NP ~ 2NV ~ 10^7).

Due to the nature of the problem, each vortex affects each point. So it makes sense to assign each point to a thread and have that thread compute the influence of all NV vortices on the point.

Despite apparently good occupancy (NP>>Nprocessors), execution of a scenario in which CUDA is used to parallelise the problem in this way is pretty slow (i.e. similar times to just running it on the CPU). My suspicion is that this is because the kernel is fairly complex, as each kernel invocation includes a for loop running through ~10^7 calculations.

I've given some consideration to dynamic parallelism, where a parent thread for each point can spawn many threads (one for each vortex) to replace the inner for loop. In other words, the child kernel is simply elemental and computes the interaction between a single vortex and a single point.

I'm no computer scientist, so was able to implement a basic CUDA approach (1. below), but I am really struggling with the indexing required to even try out a dynamic approach for benchmarking. Please forgive naivety here!

So I have two questions:

Am I barking up the right tree? Do others feel this dynamic approach would be more efficient than just parallelising the outer loop (as in pseudocode snippet 1 below)?
What is the best/correct indexing strategy? I just can't figure out how threadIdx behaves in the children. When I launch a kernel from a parent thread, do the children of that thread have their own threadIdx, or is it dependent on the parent threadIdx in some way (in which case how do I figure out the child thread index)?

Examples in pseudocode here:

Pre-dynamic CUDA Implementation

__global__ void biot_predynamic_kernel(vector3 *input1, vector3 *input2, vector3 *answer)
{
    // Get current thread's index
    int idx = blockIdx.x*blockDim.x + threadIdx.x;

    // Sizes and other inputs omitted for clarity

    // Counter up to the size of the input array NV
    int j 

    // Temporary storage for accumulating result
    vector3 temp; 

    if(idx < NP)
    {
        // initialise tempUind and answer[idx] to 0.0

        for(j=0; j<NV; j++)
        {
            // contribution of the current vortex element 
            temp = someInlineFcn(input1[j], input2[idx]);

            // accumulate contributions from each vortex element
            answer[idx] = temp + answer[idx];
        }
    }
}


void main
{

    // allocate arrays of vector3 type and do checks
    // read in data
    // Get number of output points and use that to set blockcount and threadsperblock

    dim3 dimGrid(blockcount);
    dim3 dimBlock(threadsperblock);

    biot_predynamic_kernel<<<dimGrid,dimBlock>>>(answer, input);

    // Block execution until device has completed
    cudaThreadSynchronize();

    // Check for errors and write results
}

Dynamic CUDA Implementation - having difficulty figuring out indexing

__global__ void biot_dynamic_child_kernel(vector3 *input1, vector3 *input2, vector3 result)
{
    // Get current thread indices...?????
    int vortexIdx = ?? // index to the current vortex in the input1 vector
    int pointIdx = ?? // index to the current point in the input2 vector

    // Temporary storage for results
    vector3 temp; 

    if(vortexIdx < NV)
    {
        // contribution of the current vortex element to the current point
        vector3 temp = someInlineFcn(input1[vortexIdx], input2[pointIdx]);
    }

    result = AtomicAdd(result,temp);

}

__global__ void biot_dynamic_parent_kernel(vector3 *input1, vector3 *input2, vector3 *answer)
{
    // Get current thread index
    int pointIdx = ???;

    if(pointIdx < NP)
    {
        // Initialise tempUind and answer[idx] to 0.0

        // parallelise the inner loop over all filaments - atomic addition inside the kernel prevents overwrite of answer[pointIdx] by concurrent threads
        biot_dynamic_child_kernel<<<???,???>>>(input1, input2[pointIdx], answer[pointIdx]);

        // do I need to sync threads here? If so how to sync threads spawned by just this parent as opposed to all parents?
    }
}


void main
{

    // allocate arrays of vector3 type and do checks
    // read in data
    // Get number of output points and use that to set blockcount and threadsperblock ???

    dim3 dimGrid(blockcount); //????
    dim3 dimBlock(threadsperblock); //????

    biot_dynamic_parent_kernel<<<dimGrid,dimBlock>>>(answer, input);

    // Block execution until device has completed
    cudaThreadSynchronize();

    // Check for errors and write results
}

Your problem seems to fall in the cathegory of N-body problems in which you have to calculate the effect of, say, N sources to N observation points. From your code snippet #1, it is apparent that you are using a "brute-force" approach, which sequentially scales as O(N^2). Before trying to use dynamic parallelism, it would be useful if you study tree-based approaches to the N-body problem, which sequentially reduce the computational complexity to O(NlogN). — Vitality
The basic idea is to exploit the 1/r^2 decay of the effect of each vortex and calculating in an "exact" way near-field interactions and in an approximate way far-field interactions. This is the same idea underlying the Muli-Level Fast Multipole Method (MLFMM) of electrodynamics. The CUDA toolkit contains an N-body sample, which is well explained in the "CUDA Handbook" by Nicolas Wilt. You may also wish, for your own knowledge, to have a look at the MLFMM. — Vitality
An example on how computation theory is helpful in applying the Biot-Savart law can be found at Vortex filament method as a tool for computational visualization of quantum turbulence. — Vitality
CUDA dynamic parallelism is not first and foremost about performance. It's a feature designed to make it easier to realize programs in a programmer-friendly way, increase code re-use opportunities, and increase programmer productivity, and perhaps improve code readability. But a simple re-factoring of an already solved problem using CDP is unlikely to yield significant performance benefit, if the machine is already fully utilized in the first realization. Instead, as @JackOLantern is suggesting, you may want to revisit your basic algorithm. Furthermore, analysis-driven optimization may help. — Robert Crovella
@RobertCrovella CUDA dynamic parallelism improves the performance of algorithms which do not fit a flat, single level parallelism, as interpolation. See this GTC2013 talk. — Vitality

Vitality Vitality · Accepted Answer · 2014-05-24T18:23:43

Your problem seems to fall in the cathegory of N-body problems in which you have to calculate the effect of, say, N sources to N observation points. From your code snippet #1, it is apparent that you are using a "brute-force" approach, which sequentially scales as O(N^2). Before trying to use dynamic parallelism, it would be useful if you study tree-based approaches to the N-body problem, which sequentially reduce the computational complexity to O(NlogN). Some traces can be found at:

K-means clustering for optimal domain decomposition of parallel hierarchical N-body simulations

and at

Vortex filament method as a tool for computational visualization of quantum turbulence

The basic idea is to exploit the 1/r^2 decay of the effect of each vortex and calculating in an "exact" way near-field interactions and in an approximate way far-field interactions. This is the same idea underlying the Muli-Level Fast Multipole Method (MLFMM) of electrodynamics. The CUDA toolkit contains an N-body sample, which is well explained in the "CUDA Handbook" by Nicolas Wilt. You may also wish, for your own knowledge, to have a look at the MLFMM.

As a final remark, let me mention that CUDA dynamic parallelism is useful to improve performance mainly in two cases:

when the algorithm does not fit a flat, single-level parallelism; for example, interpolation fits a two-level parallelism, see this GTC2013 Conference Talk;
when the discretization of the computational grid is nonuniform over the computational domain and changes in time; the classical example is the turbulent flow reported in the CUDA dynamic parallelism whitepaper.

Good discussions on performance improvements due to CUDA dynamic parallelism can be found in the above quoted CUDA Handbook and in the Second Edition of Programming Massively Parallel Processors, A Hands-On Approach by David B. Kirk and Wen-mei W. Hwu.

CUDA dynamic parallelism and the vortical component calculation from Biot-Savart law

1 Answers