3
votes

I am developing some program than runs on 4 node cluster with 4 cores on each node. I have a quite fast version of OpenMP version of the program that only runs on one cluster and I am trying to scale it using MPI. Due to my limited experience I am wondering which one would give me faster performance, a OpenMP hybrid architecture or a MPI only architecture? I have seen this slide claiming that the hybrid one generally cannot out perform the pure MPI one, but it does not give supporting evidence and is kind of counter-intuitive for me.

BTW, My platform use infiniband to interconnect nodes.

Thank a lot, Bob

1
This can entirely depend on your MPI/OpenMP implementation, as well as the design of your algorithm (e.g. number and size of MPI messages). Why not profile both methods?suszterpatt
@suszterpatt I am aware of that, I am asking in a general sense to see if there is any theoretical reasoning about two different approaches that one can be definitely better than the other. It's more like is quick sort better than bubble sort? Yes it depends on implementation and it depends on what you really want, but we all know quick sort is in theory faster.Boyu Fang
There is no well-developed theory to support an argument from first principles that a hybrid program will be faster (or slower) than a pure program. The only answer(s) you will get will come from experimentation.High Performance Mark
I can confirm the slide based on my own experience.Chiel

1 Answers

1
votes

Shared memory is usually more efficient than message passing, as the latter usually requires increased data movement (moving data from the source to its destination) which is costly both performance-wise and energy-wise. This cost is predicted to keep growing with every generation.

The material states that MPI-only applications are usually on-par or better than hybrid applications, although they usually have larger memory requirements.

However, they are based on the fact that most of the large hybrid applications shown were based on parallel computation then serial communication.

This kind of implementations are usually susceptible to the following problems:

  1. Non uniform memory access: having two sockets in a single node is a popular setup in HPC. Since modern processors have their memory controller on chip, half of the memory will be easily accessible from the local memory controller, meanwhile the other half has to pass through the remote memory controller (i.e., the one present in the other socket). Therefore, how the program allocates memory is very important: if the memory is reserved in the serialized phase (on the closest possible memory), then half of the cores will suffer longer main memory accesses.

  2. Load balance: each *parallel computation to serialized communication** phase implies a synchronization barrier. This barriers force the fastest cores to wait for the slowest cores in a parallel region. Fastest/slowest unbalance may be affected by OS preemption (time is shared with other system processes), dynamic frequency scaling, etc.

Some of this issues are more straightforward to solve than others. For example, the multiple-socket NUMA problem can be mitigated placing different MPI processes in different sockets inside the same node.

To really exploit the efficiency of shared memory parallelism, the best option is trying to overlap communication with computation and ensure load balance between all processes, so that the synchronization cost is mitigated.

However, developing hybrid applications which are both load balanced and do not impose big synchronization barriers is very difficult, and nowadays there is a strong research effort to address this complexity.