2
votes

I use a cluster which contains several nodes. Each them has 2 processors with 8 cores inside. I use Open MPI with SLURM.

My tests show that MPI Send/Recv data transfer rate is the following: between MPI process with rank 0 and MPI process 1 it's about 9 GB/sec, but between process 0 and process 2 it's 5 GB/sec. I assume that this happens because our processes execute on different processors.

I'd like to avoid non-local memory access. The recommendations I found here did not help. So the question is, is it possible to run 8 MPI processes - all on THE SAME processor? If it is - how do I do it?

Thanks.

4

4 Answers

2
votes

The following set of command-line options to mpiexec should do the trick with versions of Open MPI before 1.7:

--by-core --bind-to-core --report-bindings

The last option will pretty-print the actual binding for each rank. Binding also activates some NUMA-awareness in the shared-memory BTL module.

Starting with Open MPI 1.7, processes are distributed round-robin over the available sockets and bound to a single core by default. To replicate the above command line, one should use:

--map-by core --bind-to core --report-bindings
0
votes

It appears to be possible. The Process Binding and Rankfiles sections of the OpenMPI mpirun man page look promising. I would try some of the options shown with the --report-binding option set so you can verify process placement is how you intend and see if you get the performance improvement you expect out of your code.

0
votes

You should look at the hostfile / rankfile documentation for your MPI library. Open MPI and MPICH both use different formats, but both will give you what you want.

Keep in mind that you will have performance issues if you oversubscribe your processor too heavily. Running more than 8 ranks on an 8 core processor will cause you to lose the performance benefits you gain from having locally shared memory.

0
votes

With Slurm, set:

#SBATCH --ntasks=8
#SBATCH --ntasks-per-socket=8

to have all cores allocated on the same socket (CPU die), provided Slurm is correctly configured.