Well if you make number of MPI tasks higher than number of cores it makes no sense, because you start to enforce 2 tasks on one processing unit, and therefore exhaustion of computing resources.
When it comes to preferred substantially lower number of tasks over cores on Xeon Phi. Maybe they prefer threads over processes. The architecture of Xeon Phi is quite peculiar and overhead introduced by maintaining an MPI task can seriously cripple computing performance. I will not hide that I do not know technical reason behind it. But maybe someone will fill it in.
If I recall correctly communication bus in there is a ring (or two rings), so maybe all to all communication and barriers are polluting bus and turns out to be ineffective.
Using threads or the native execution mode they provide has less overhead.
Also I think you should look at it more like a multicore CPU, not a multi-CPU machine. For greater performance you don't want to run 4 MPI tasks on a 4-core CPU either, you want to run one 4-threaded MPI task.