How does MPI task distribution work

Question

I'm researching ways to distribute tasks across machines or a cluster. MPI seems to be the traditional way to do this, but the way it works also seems archaic: you write a single program that does just the task, detect in the code which node (process) you're running on, then send data to other processes to do the calculations. While there is plenty of information to be found on how to use the MPI API, I can't find a comprehensive description of how tasks are started on other machines. Reading between the lines, it seems that the 'task manager' (mpirun or mpiexec or similar) copies the whole executable (in a primitive way, just an scp or so) to the other machine, then runs it there. Sharing data then happens through network shares, nfs or cifs or similar.

So I'm wondering if the specification also covers things like dependent libraries (what if my application depends on a shared library and the compute nodes don't have it?), cross-platform communication (the spec claims to be 'platform neutral' which is true I guess if you're talking about being able to be implemented on all platforms), but there doesn't seem to be a common or reliable way to start a task from a Windows machine on a Linux cluster, for example.

What I want to do is have a long-running program that occasionally runs CPU-intensive tasks on many machines in small bursts. I also don't want to have to code to one cluster - I'm looking for a generic solution that can be run on real, thousands-of-nodes Linux clusters as well as on small LAN 'clusters' of workstations in a lab that are idling at night.

Is that possible with MPI? Or maybe I should ask, are there implementations of MPI that support this? Or should I forget about MPI and just implement my own task-specific solution?

Having multiple instances of the same program--usually called Single Program Multiple Data (SPMD) model--is just one of the possible ways to use MPI. It also supports the Multiple Program Multiple Data (MPMD) model with several different executables working together and a client-server model. It also has provisions for starting other jobs on demand. But from your description it sounds more like you need a distributed resource/workload manager rather than MPI. — Hristo Iliev
You may consider htcondor: research.cs.wisc.edu/htcondor/description.html which is a workload management system for CPU intensive computation — terence hill

Pooja Nilangekar Pooja Nilangekar · Accepted Answer · 2015-11-04T10:18:35

EDIT

My answer is based on the common "Classical Cluster and Supercomputer" implementation of MPI (MPICH and OpenMPI).

So here is what you need to understand, MPI does not copy the executables to the other machines. To run an MPI program, you need to ensure that an executable of the same name is present at each machine at the same path. MPI does not do it for you. Ideally you can compile the source file on a machine and distribute it to all other machines on a cluster. You can either do it using rsync, scp. ftp, etc.

When you start an MPI program using mpiexec or mpirun, the process manager launches the executable on the machines specified in the host file. Here the number of processes have to be specified by you using the -n parameter.

MPI is Message Passing Interface, so esentially, it uses the message passing model, not a shared memory model. It uses TCP to pass the messages among the various processes.

As it is the executable which is distributed, there is no need to have all the libraries on each of the cluster, having the same version of MPI installed will suffice. MPI supports Processor, Network and Runtime Environment Heterogeneity.

As Hirsto explained, you can either have SPMD or MPMD, it all depends on your needs. If you have a long-running program that occasionally runs CPU-intensive tasks on many machines in small bursts, you could start a single MPI Program which spawns multiple children to process the CPU-intensive bursts on other machines on the cluster.

MPI does provide a generic solution that can be run on real, thousands-of-nodes Linux clusters as well as on small LAN 'clusters' of workstations. Your executable file remains the same, it's only the host file which varies. (You have to specify the hostnames and the number of processors in the host file)

Now regarding the scheduling of the task at night, MPI won't do it for you. You could schedule these MPI tasks at night or whichever time of the day using a task scheduler (eg. cron), which is not specific to MPI.

How does MPI task distribution work

2 Answers