1
votes

I'm solving the Biot Savart law using CUDA, determining the velocity at NP points induced by NV line element vortices (where NP ~ 2NV ~ 10^7).

Due to the nature of the problem, each vortex affects each point. So it makes sense to assign each point to a thread and have that thread compute the influence of all NV vortices on the point.

Despite apparently good occupancy (NP>>Nprocessors), execution of a scenario in which CUDA is used to parallelise the problem in this way is pretty slow (i.e. similar times to just running it on the CPU). My suspicion is that this is because the kernel is fairly complex, as each kernel invocation includes a for loop running through ~10^7 calculations.

I've given some consideration to dynamic parallelism, where a parent thread for each point can spawn many threads (one for each vortex) to replace the inner for loop. In other words, the child kernel is simply elemental and computes the interaction between a single vortex and a single point.

I'm no computer scientist, so was able to implement a basic CUDA approach (1. below), but I am really struggling with the indexing required to even try out a dynamic approach for benchmarking. Please forgive naivety here!

So I have two questions:

  1. Am I barking up the right tree? Do others feel this dynamic approach would be more efficient than just parallelising the outer loop (as in pseudocode snippet 1 below)?

  2. What is the best/correct indexing strategy? I just can't figure out how threadIdx behaves in the children. When I launch a kernel from a parent thread, do the children of that thread have their own threadIdx, or is it dependent on the parent threadIdx in some way (in which case how do I figure out the child thread index)?

Examples in pseudocode here:

  1. Pre-dynamic CUDA Implementation

    __global__ void biot_predynamic_kernel(vector3 *input1, vector3 *input2, vector3 *answer)
    {
        // Get current thread's index
        int idx = blockIdx.x*blockDim.x + threadIdx.x;
    
        // Sizes and other inputs omitted for clarity
    
        // Counter up to the size of the input array NV
        int j 
    
        // Temporary storage for accumulating result
        vector3 temp; 
    
        if(idx < NP)
        {
            // initialise tempUind and answer[idx] to 0.0
    
            for(j=0; j<NV; j++)
            {
                // contribution of the current vortex element 
                temp = someInlineFcn(input1[j], input2[idx]);
    
                // accumulate contributions from each vortex element
                answer[idx] = temp + answer[idx];
            }
        }
    }
    
    
    void main
    {
    
        // allocate arrays of vector3 type and do checks
        // read in data
        // Get number of output points and use that to set blockcount and threadsperblock
    
        dim3 dimGrid(blockcount);
        dim3 dimBlock(threadsperblock);
    
        biot_predynamic_kernel<<<dimGrid,dimBlock>>>(answer, input);
    
        // Block execution until device has completed
        cudaThreadSynchronize();
    
        // Check for errors and write results
    }
    
  2. Dynamic CUDA Implementation - having difficulty figuring out indexing

    __global__ void biot_dynamic_child_kernel(vector3 *input1, vector3 *input2, vector3 result)
    {
        // Get current thread indices...?????
        int vortexIdx = ?? // index to the current vortex in the input1 vector
        int pointIdx = ?? // index to the current point in the input2 vector
    
        // Temporary storage for results
        vector3 temp; 
    
        if(vortexIdx < NV)
        {
            // contribution of the current vortex element to the current point
            vector3 temp = someInlineFcn(input1[vortexIdx], input2[pointIdx]);
        }
    
        result = AtomicAdd(result,temp);
    
    }
    
    __global__ void biot_dynamic_parent_kernel(vector3 *input1, vector3 *input2, vector3 *answer)
    {
        // Get current thread index
        int pointIdx = ???;
    
        if(pointIdx < NP)
        {
            // Initialise tempUind and answer[idx] to 0.0
    
            // parallelise the inner loop over all filaments - atomic addition inside the kernel prevents overwrite of answer[pointIdx] by concurrent threads
            biot_dynamic_child_kernel<<<???,???>>>(input1, input2[pointIdx], answer[pointIdx]);
    
            // do I need to sync threads here? If so how to sync threads spawned by just this parent as opposed to all parents?
        }
    }
    
    
    void main
    {
    
        // allocate arrays of vector3 type and do checks
        // read in data
        // Get number of output points and use that to set blockcount and threadsperblock ???
    
        dim3 dimGrid(blockcount); //????
        dim3 dimBlock(threadsperblock); //????
    
        biot_dynamic_parent_kernel<<<dimGrid,dimBlock>>>(answer, input);
    
        // Block execution until device has completed
        cudaThreadSynchronize();
    
        // Check for errors and write results
    }
    
1
Your problem seems to fall in the cathegory of N-body problems in which you have to calculate the effect of, say, N sources to N observation points. From your code snippet #1, it is apparent that you are using a "brute-force" approach, which sequentially scales as O(N^2). Before trying to use dynamic parallelism, it would be useful if you study tree-based approaches to the N-body problem, which sequentially reduce the computational complexity to O(NlogN).Vitality
The basic idea is to exploit the 1/r^2 decay of the effect of each vortex and calculating in an "exact" way near-field interactions and in an approximate way far-field interactions. This is the same idea underlying the Muli-Level Fast Multipole Method (MLFMM) of electrodynamics. The CUDA toolkit contains an N-body sample, which is well explained in the "CUDA Handbook" by Nicolas Wilt. You may also wish, for your own knowledge, to have a look at the MLFMM.Vitality
An example on how computation theory is helpful in applying the Biot-Savart law can be found at Vortex filament method as a tool for computational visualization of quantum turbulence.Vitality
CUDA dynamic parallelism is not first and foremost about performance. It's a feature designed to make it easier to realize programs in a programmer-friendly way, increase code re-use opportunities, and increase programmer productivity, and perhaps improve code readability. But a simple re-factoring of an already solved problem using CDP is unlikely to yield significant performance benefit, if the machine is already fully utilized in the first realization. Instead, as @JackOLantern is suggesting, you may want to revisit your basic algorithm. Furthermore, analysis-driven optimization may help.Robert Crovella
@RobertCrovella CUDA dynamic parallelism improves the performance of algorithms which do not fit a flat, single level parallelism, as interpolation. See this GTC2013 talk.Vitality

1 Answers

3
votes

Your problem seems to fall in the cathegory of N-body problems in which you have to calculate the effect of, say, N sources to N observation points. From your code snippet #1, it is apparent that you are using a "brute-force" approach, which sequentially scales as O(N^2). Before trying to use dynamic parallelism, it would be useful if you study tree-based approaches to the N-body problem, which sequentially reduce the computational complexity to O(NlogN). Some traces can be found at:

K-means clustering for optimal domain decomposition of parallel hierarchical N-body simulations

and at

Vortex filament method as a tool for computational visualization of quantum turbulence

The basic idea is to exploit the 1/r^2 decay of the effect of each vortex and calculating in an "exact" way near-field interactions and in an approximate way far-field interactions. This is the same idea underlying the Muli-Level Fast Multipole Method (MLFMM) of electrodynamics. The CUDA toolkit contains an N-body sample, which is well explained in the "CUDA Handbook" by Nicolas Wilt. You may also wish, for your own knowledge, to have a look at the MLFMM.

As a final remark, let me mention that CUDA dynamic parallelism is useful to improve performance mainly in two cases:

  1. when the algorithm does not fit a flat, single-level parallelism; for example, interpolation fits a two-level parallelism, see this GTC2013 Conference Talk;
  2. when the discretization of the computational grid is nonuniform over the computational domain and changes in time; the classical example is the turbulent flow reported in the CUDA dynamic parallelism whitepaper.

Good discussions on performance improvements due to CUDA dynamic parallelism can be found in the above quoted CUDA Handbook and in the Second Edition of Programming Massively Parallel Processors, A Hands-On Approach by David B. Kirk and Wen-mei W. Hwu.