I'm struggling to understand how to setup a distributed MPI cluster with ipython/ipyparallel. I don't have a strong MPI background.
I've followed the following instructions in the ipyparallel docs (Using ipcluster in mpiexec/mpirun mode) and this works fine for distributing computation on a single node machine. So creating an mpi
profile, configuring it as per the instructions above, and starting the cluster
$ ipython profile create --parallel --profile=mpi
$ vim ~/.ipython/profile_mpi/ipcluster_config.py
Then on host A I start a controller and 4 MPI engines:
$ ipcontroller --ip='*' --profile=mpi
$ ipcluster engines --n=4 --profile=mpi
Running the following snippet:
from ipyparallel import Client
from mpi4py import MPI
c = Client(profile='mpi')
view = c[:]
print("Client MPI.COMM_WORLD.Get_size()=%s" % MPI.COMM_WORLD.Get_size())
print("Client engine ids %s" % c.ids)
def _get_rank():
from mpi4py import MPI
return MPI.COMM_WORLD.Get_rank()
def _get_size():
from mpi4py import MPI
return MPI.COMM_WORLD.Get_size()
print("Remote COMM_WORLD ranks %s" % view.apply_sync(_get_rank))
print("Remote COMM_WORLD size %s" % view.apply_sync(_get_size))
yields
Client MPI.COMM_WORLD.Get_size()=1
Client engine ids [0, 1, 2, 3]
Remote COMM_WORLD ranks [1, 0, 2, 3]
Remote COMM_WORLD size [4, 4, 4, 4]
Then on host B I start 4 MPI engines. I run the snippet again which yields
Client MPI.COMM_WORLD.Get_size()=1
Client engine ids [0, 1, 2, 3, 4, 5, 6, 7]
Remote COMM_WORLD ranks [1, 0, 2, 3, 2, 3, 0, 1]
Remote COMM_WORLD size [4, 4, 4, 4, 4, 4, 4, 4]
It seems that the engines from each ipcluster
command are grouped into separate communicators or size 4, hence the duplicate ranks. And there's only one MPI process for the client.
Questions:
- ipython/ipyparallel doesn't seem to be making the MPI connections between hosts. Should ipyparallel be handling the MPI setup, or should I, as a user be creating MPI setup, as "IPython MPI with a Machinefile" suggests? I guess my assumption was that ipyparallel would automatically handle things but this doesn't seem to be the case.
- Is there any documentation on how to set up distributed MPI with ipyparallel? I've googled around but found nothing obvious.
- Depending on the above, is ipython/ipyparallel only designed to handle local MPI connections, in order to avoid data transfers between the controller and engines?
EDIT
The answer to the first question seems to be that all MPI nodes have to be brought up at once. This is because:
- Dynamic Nodes in OpenMPI suggests that it is not possible to add nodes post-launch.
MPI - Add/remove node while program is running suggests that child nodes can be added through MPI_Comm_spawn. However, according to MPI_Comm_spawn
MPI_Comm_spawn tries to start maxprocs identical copies of the MPI program specified by command, establishing communication with them and returning an intercommunicator. The spawned processes are referred to as children. The children have their own MPI_COMM_WORLD, which is separate from that of the parents.
A quick grep through the ipyparallel code suggests that this functionality isn't employed.
A partial answer to the second question is that a machinefile needs to be used so that MPI knows which remote machines it can create processes on.
The implication here is that the setup on each remote is homogenous, as provided by a cluster system like Torque/SLURM etc. Otherwise, if one is trying to use random remotes, one is going to have to do work to ensure that the environment mpiexec is executing on is homogenous.
A partial answer to the third question is no, ipyparallel can presumably work with remote MPI process, but one needs to create one ipyparall engine per MPI process.