1
votes

I have some .cpp files which implement Smoothed Particle hydrodynamics, which is a particle method for modelling fluid flow.

One of the most time consuming components in these particle techniques is finding the nearest neighbours (K-nearest neighbours or Range searching ) for every particle at every time-step of the simulation.

Right now I just want to accelerate the neighbor search routine using GPU's and CUDA, replacing my current CPU based neighbour search routine. Only neighbour search will run on the GPU's while the rest of the simulation proceeds on the CPU.

My question is, how should I go about compiling the entire code? To be more specific, suppose I write the neighbour search kernel function in a file nsearch.cu.

Then should I rename all my previous .cpp files as .cu files and re-compile the whole set (along with nsearch.cu) using nvcc? For simple examples at least, nvcc cannot compile CUDA codes with extension .cpp i.e nvcc foo.cu compiles but nvcc hello.cpp doesn't.

In short, what should be the structure of this CUDA plugin and how should I go about compiling it?

I am using Ubuntu Linux 10.10, CUDA 4.0, NVIDIA GTX 570 (Compute capability 2.0) and the gcc compiler for my work

2

2 Answers

2
votes

You need to write the nsearch.cu file and compile it with "nvcc -c -o nsearch.o" and then link nsearch.o with the main application. There has to be a nsearch.h file that exports a wrapper around the actual kernel.

in nsearch.h : 
void kern();

in nsearch.cu:
void __global__ kern__() {
}
void kern() {
  kern__<<<...>>>();
}
0
votes

This is a broader response to your question, since I have been through a very similar thought process to you - moving my hydrodynamic code on to GPU whilst leaving everything else on CPU. Although I think that's where you should start, I also think you should start planning to move all of your other code on to the GPU as well. What I found is that whilst the GPU was very good at doing the matrix decomposition required for my simulation, the memory boundary between GPU and CPU memory was so slow that something like 80-90% of the GPU simulation time was being spent in cudaMemcpyDeviceToHost/cudaMemcpyHostToDevice.