Shared memory is usually more efficient than message passing, as the latter usually requires increased data movement (moving data from the source to its destination) which is costly both performance-wise and energy-wise. This cost is predicted to keep growing with every generation.
The material states that MPI-only applications are usually on-par or better than hybrid applications, although they usually have larger memory requirements.
However, they are based on the fact that most of the large hybrid applications shown were based on parallel computation then serial communication.
This kind of implementations are usually susceptible to the following problems:
Non uniform memory access: having two sockets in a single node is a popular setup in HPC. Since modern processors have their memory controller on chip, half of the memory will be easily accessible from the local memory controller, meanwhile the other half has to pass through the remote memory controller (i.e., the one present in the other socket). Therefore, how the program allocates memory is very important: if the memory is reserved in the serialized phase (on the closest possible memory), then half of the cores will suffer longer main memory accesses.
Load balance: each *parallel computation to serialized communication** phase implies a synchronization barrier. This barriers force the fastest cores to wait for the slowest cores in a parallel region. Fastest/slowest unbalance may be affected by OS preemption (time is shared with other system processes), dynamic frequency scaling, etc.
Some of this issues are more straightforward to solve than others. For example,
the multiple-socket NUMA problem can be mitigated placing different MPI processes in different sockets inside the same node.
To really exploit the efficiency of shared memory parallelism, the best option is trying to overlap communication with computation and ensure load balance between all processes, so that the synchronization cost is mitigated.
However, developing hybrid applications which are both load balanced and do not impose big synchronization barriers is very difficult, and nowadays there is a strong research effort to address this complexity.