I'm solving the Biot Savart law using CUDA, determining the velocity at NP
points induced by NV
line element vortices (where NP ~ 2NV ~ 10^7
).
Due to the nature of the problem, each vortex affects each point. So it makes sense to assign each point to a thread and have that thread compute the influence of all NV
vortices on the point.
Despite apparently good occupancy (NP>>Nprocessors
), execution of a scenario in which CUDA is used to parallelise the problem in this way is pretty slow (i.e. similar times to just running it on the CPU). My suspicion is that this is because the kernel is fairly complex, as each kernel invocation includes a for loop running through ~10^7
calculations.
I've given some consideration to dynamic parallelism, where a parent thread for each point can spawn many threads (one for each vortex) to replace the inner for loop. In other words, the child kernel is simply elemental and computes the interaction between a single vortex and a single point.
I'm no computer scientist, so was able to implement a basic CUDA approach (1. below), but I am really struggling with the indexing required to even try out a dynamic approach for benchmarking. Please forgive naivety here!
So I have two questions:
Am I barking up the right tree? Do others feel this dynamic approach would be more efficient than just parallelising the outer loop (as in pseudocode snippet 1 below)?
What is the best/correct indexing strategy? I just can't figure out how
threadIdx
behaves in the children. When I launch a kernel from a parent thread, do the children of that thread have their ownthreadIdx
, or is it dependent on the parentthreadIdx
in some way (in which case how do I figure out the child thread index)?
Examples in pseudocode here:
Pre-dynamic CUDA Implementation
__global__ void biot_predynamic_kernel(vector3 *input1, vector3 *input2, vector3 *answer) { // Get current thread's index int idx = blockIdx.x*blockDim.x + threadIdx.x; // Sizes and other inputs omitted for clarity // Counter up to the size of the input array NV int j // Temporary storage for accumulating result vector3 temp; if(idx < NP) { // initialise tempUind and answer[idx] to 0.0 for(j=0; j<NV; j++) { // contribution of the current vortex element temp = someInlineFcn(input1[j], input2[idx]); // accumulate contributions from each vortex element answer[idx] = temp + answer[idx]; } } } void main { // allocate arrays of vector3 type and do checks // read in data // Get number of output points and use that to set blockcount and threadsperblock dim3 dimGrid(blockcount); dim3 dimBlock(threadsperblock); biot_predynamic_kernel<<<dimGrid,dimBlock>>>(answer, input); // Block execution until device has completed cudaThreadSynchronize(); // Check for errors and write results }
Dynamic CUDA Implementation - having difficulty figuring out indexing
__global__ void biot_dynamic_child_kernel(vector3 *input1, vector3 *input2, vector3 result) { // Get current thread indices...????? int vortexIdx = ?? // index to the current vortex in the input1 vector int pointIdx = ?? // index to the current point in the input2 vector // Temporary storage for results vector3 temp; if(vortexIdx < NV) { // contribution of the current vortex element to the current point vector3 temp = someInlineFcn(input1[vortexIdx], input2[pointIdx]); } result = AtomicAdd(result,temp); } __global__ void biot_dynamic_parent_kernel(vector3 *input1, vector3 *input2, vector3 *answer) { // Get current thread index int pointIdx = ???; if(pointIdx < NP) { // Initialise tempUind and answer[idx] to 0.0 // parallelise the inner loop over all filaments - atomic addition inside the kernel prevents overwrite of answer[pointIdx] by concurrent threads biot_dynamic_child_kernel<<<???,???>>>(input1, input2[pointIdx], answer[pointIdx]); // do I need to sync threads here? If so how to sync threads spawned by just this parent as opposed to all parents? } } void main { // allocate arrays of vector3 type and do checks // read in data // Get number of output points and use that to set blockcount and threadsperblock ??? dim3 dimGrid(blockcount); //???? dim3 dimBlock(threadsperblock); //???? biot_dynamic_parent_kernel<<<dimGrid,dimBlock>>>(answer, input); // Block execution until device has completed cudaThreadSynchronize(); // Check for errors and write results }
N
-body problems in which you have to calculate the effect of, say,N
sources toN
observation points. From your code snippet #1, it is apparent that you are using a "brute-force" approach, which sequentially scales asO(N^2)
. Before trying to use dynamic parallelism, it would be useful if you study tree-based approaches to theN
-body problem, which sequentially reduce the computational complexity toO(NlogN)
. – Vitality1/r^2
decay of the effect of each vortex and calculating in an "exact" way near-field interactions and in an approximate way far-field interactions. This is the same idea underlying the Muli-Level Fast Multipole Method (MLFMM) of electrodynamics. The CUDA toolkit contains anN
-body sample, which is well explained in the "CUDA Handbook" by Nicolas Wilt. You may also wish, for your own knowledge, to have a look at the MLFMM. – Vitality