Setting up a distributed ipython/ipyparallel MPI cluster

Question

I'm struggling to understand how to setup a distributed MPI cluster with ipython/ipyparallel. I don't have a strong MPI background.

I've followed the following instructions in the ipyparallel docs (Using ipcluster in mpiexec/mpirun mode) and this works fine for distributing computation on a single node machine. So creating an mpi profile, configuring it as per the instructions above, and starting the cluster

$ ipython profile create --parallel --profile=mpi
$ vim ~/.ipython/profile_mpi/ipcluster_config.py

Then on host A I start a controller and 4 MPI engines:

$ ipcontroller --ip='*' --profile=mpi    
$ ipcluster engines --n=4 --profile=mpi

Running the following snippet:

from ipyparallel import Client
from mpi4py import MPI

c = Client(profile='mpi')
view = c[:]

print("Client MPI.COMM_WORLD.Get_size()=%s" % MPI.COMM_WORLD.Get_size())
print("Client engine ids %s" % c.ids)

def _get_rank():
    from mpi4py import MPI
    return MPI.COMM_WORLD.Get_rank()

def _get_size():
    from mpi4py import MPI
    return MPI.COMM_WORLD.Get_size()

print("Remote COMM_WORLD ranks %s" % view.apply_sync(_get_rank))
print("Remote COMM_WORLD size %s" % view.apply_sync(_get_size))

yields

Client MPI.COMM_WORLD.Get_size()=1
Client engine ids [0, 1, 2, 3]
Remote COMM_WORLD ranks [1, 0, 2, 3]
Remote COMM_WORLD size [4, 4, 4, 4]

Then on host B I start 4 MPI engines. I run the snippet again which yields

Client MPI.COMM_WORLD.Get_size()=1
Client engine ids [0, 1, 2, 3, 4, 5, 6, 7]
Remote COMM_WORLD ranks [1, 0, 2, 3, 2, 3, 0, 1]
Remote COMM_WORLD size [4, 4, 4, 4, 4, 4, 4, 4]

It seems that the engines from each ipcluster command are grouped into separate communicators or size 4, hence the duplicate ranks. And there's only one MPI process for the client.

Questions:

ipython/ipyparallel doesn't seem to be making the MPI connections between hosts. Should ipyparallel be handling the MPI setup, or should I, as a user be creating MPI setup, as "IPython MPI with a Machinefile" suggests? I guess my assumption was that ipyparallel would automatically handle things but this doesn't seem to be the case.
Is there any documentation on how to set up distributed MPI with ipyparallel? I've googled around but found nothing obvious.
Depending on the above, is ipython/ipyparallel only designed to handle local MPI connections, in order to avoid data transfers between the controller and engines?

EDIT

The answer to the first question seems to be that all MPI nodes have to be brought up at once. This is because:
- Dynamic Nodes in OpenMPI suggests that it is not possible to add nodes post-launch.
- MPI - Add/remove node while program is running suggests that child nodes can be added through MPI_Comm_spawn. However, according to MPI_Comm_spawn
  
  MPI_Comm_spawn tries to start maxprocs identical copies of the MPI program specified by command, establishing communication with them and returning an intercommunicator. The spawned processes are referred to as children. The children have their own MPI_COMM_WORLD, which is separate from that of the parents.
  
  A quick grep through the ipyparallel code suggests that this functionality isn't employed.
A partial answer to the second question is that a machinefile needs to be used so that MPI knows which remote machines it can create processes on.

The implication here is that the setup on each remote is homogenous, as provided by a cluster system like Torque/SLURM etc. Otherwise, if one is trying to use random remotes, one is going to have to do work to ensure that the environment mpiexec is executing on is homogenous.
A partial answer to the third question is no, ipyparallel can presumably work with remote MPI process, but one needs to create one ipyparall engine per MPI process.

the 2 point at the beginning of ipyparallel docs "Your systems are configured to use the mpiexec or mpirun commands to start MPI processes." strongly suggest that everything must be already working and then python just use it. So first you have to configure your cluster to work with mpi. — terence hill
@terencehill Agreed, I'm realising that the MPI setup I'm using is unorthodox -- I'm creating a development cluster from heterogenous machines without a central cluster system. Actually getting MPI to work is fairly easy. Getting it to work with ipyparallel in the presence varying usernames, no distributed FS and python virtualenv's is proving to be challenging. I'll use this Question to record progress though! — Simon

minrk minrk · Accepted Answer · 2015-11-12T12:39:45

When you start engines with MPI in IPython parallel, it ultimately boils down to a single call of:

mpiexec [-n N] ipengine

It does no configuration of MPI. If you start multiple groups of engines on different hosts, each group will be in its own MPI universe, which is what you are seeing. The first thing to do is to make sure that everything's working as you expect with a single call to mpiexec before you bring IPython parallel into it.

As mentioned in IPython parallel with a machine file, to use multi-host MPI, you typically need a machinefile to specify launching multiple engines on multiple hosts. For example:

# ~/mpi_hosts
machine1 slots=4
machine2 slots=4

You can use a simple test script for diagnostics:

# test_mpi.py
import os
import socket
from mpi4py import MPI

MPI = MPI.COMM_WORLD

print("{host}[{pid}]: {rank}/{size}".format(
    host=socket.gethostname(),
    pid=os.getpid(),
    rank=MPI.rank,
    size=MPI.size,
))

And run it:

$ mpiexec -machinefile ~/mpi_hosts -n 8 python test_mpi.py 
machine1[32292]: 0/8
machine1[32293]: 1/8
machine1[32294]: 2/8
machine1[32295]: 3/8
machine2[32296]: 4/8
machine2[32297]: 5/8
machine2[32298]: 6/8
machine2[32299]: 7/8

Once that's working as expected, you can add

c.MPILauncher.mpi_args = ["-machinefile", "~/mpi_hosts"]

to your ~/.ipython/profile_default/ipcluster_config.py and start your engines with

ipcluster start -n 8

Setting up a distributed ipython/ipyparallel MPI cluster

1 Answers