3
votes

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:

Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.

So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.

I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).

EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?

1

1 Answers

4
votes

It sounds like you are having implementation trouble.

There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.

I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.

Appendix A of the OpenCL 1.2 specification says that:

OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.

Further:

The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.

So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.

To put things more in terms of your question:

What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?

... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.

As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.

I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.

Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.

If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.

All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.

I hope I was helpful,

Ryan