1
votes

As far as I know, in a multiprocessor environment any thread/process can be allocated to any core/processor so, what is meant by following line:

the number of MPI ranks used on an Intel Xeon Phi coprocessor should be substantially fewer than the number of cores in no small part because of limited memory on the coprocessor.

I mean, what are the issues if #cores <= #MPI Ranks ?

2

2 Answers

1
votes

That quote is correct only when it is applied to a memory size constrained problem; in general it would be an incorrect statement. In general you should use more tasks than you have physical cores on the Xeon Phi in order to hide memory latency1.

To answer your question "What are the issues if the number of cores is fewer than the number of MPI ranks?": you run the risk of having too much context switching. On many problems it is advantageous to use more tasks than you have cores to hide memory latency2.

1. I don't even feel like I need to cite a reference for this because how loudly it is advertised; however, they do mention it in an article on the OpenCL design document: http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor

2. This advice applies to the Xeon Phi specifically, not necessarily other pieces of hardware.

0
votes

Well if you make number of MPI tasks higher than number of cores it makes no sense, because you start to enforce 2 tasks on one processing unit, and therefore exhaustion of computing resources.

When it comes to preferred substantially lower number of tasks over cores on Xeon Phi. Maybe they prefer threads over processes. The architecture of Xeon Phi is quite peculiar and overhead introduced by maintaining an MPI task can seriously cripple computing performance. I will not hide that I do not know technical reason behind it. But maybe someone will fill it in.

If I recall correctly communication bus in there is a ring (or two rings), so maybe all to all communication and barriers are polluting bus and turns out to be ineffective.

Using threads or the native execution mode they provide has less overhead.

Also I think you should look at it more like a multicore CPU, not a multi-CPU machine. For greater performance you don't want to run 4 MPI tasks on a 4-core CPU either, you want to run one 4-threaded MPI task.