Running MPI code in embarrassingly parallel (on PBS-Torque cluster)

Question

I have written an MPI based C-code that I use to perform numerical simulations in parallel. Due to some poor design on my part, I have built in some inherent MPI dependencies into the code (array structures, MPI-IO). This means that if I want to run my code in serial, I have to invoke

mpiexec -n 1 c_exe

Main problem I use my C code within a Python workflow that is simplified in the loop below.

import os 
import subprocess

homedir = os.getenv('PBS_O_WORKDIR')

nevents = 100
for ievent in range(nevents):

    perform_workflow_management()
    os.chdir(ievent)
    subprocess.call('mpiexec -n 1 c_exe', Shell=True)
    os.chdir(homedir)

The Python workflow is primarily for management and makes calls to the C code which performs the numerically intensive work.

The tasks within the Python for loop are independent, consequently I would like to employ an embarrassingly parallel scheme to parallelize the loop over events. Benchmarks indicate that parallelizing the loop over events will be faster than a serial loop with parallel MPI calls. Furthermore, I am running this on a PBS-Torque cluster.

I am at a loss about how to do this effectively. The complication seems to arise due to MPI call to my C code and the assignment of multiple MPI tasks.

Things I have tried in some form

Wrappers to pbsdsh - incur problems with processor assignment.

MPMD with mpiexec - Theoretically does what I would like but fails because all processes seem to share MPI_COMM_WORLD. My C code establishes a cartesian topology for domain based parallelism; conflicts arise here.

Does anyone have suggestions on how I might achieve deploy this in an embarrassingly parallel fashion? Ideally I would like to submit a job request

qsub -l nodes=N:ppn=1,walltime=XX:XX:XX go_python_job.bash

where N is the number of processors. On each process I would then like to be able to submit independent mpiexec calls to my C code.

I'm aware that part of the issue is down to design flaws but if I could find a solution without having to refactor large parts of code that would be advantageous.

1) How many "events" / time per event are we talking? It might just be feasible to just launch a job per event (via python) - exposing the maximal amount of parallelism to the batch system. This way you get the best back-fill - but you may overload the batch system. 2) Have you tried just replacing all instances of MPI_COMM_WORLD with MPI_COMM_SELF except for the intial workflow management? — Zulan
So events are typically on the order of 100's and each C call can take ~10 minutes for a typical problem size (run in serial). I haven't tried MPI_COMM_SELF although it seems that could be a solution. If I ran MPMD as follows mpiexec -n 1 a.out : -n 1 b.out, and replaced MPI_COMM_WORLD with MPI_COMM_SELF would each instance only access the ranks on which it was launched? — MattMatt
Are multiple jobs per node possible on your cluster, or do you always block a full node even if you request only one core? — Zulan
I believe multiple jobs per node are possible. I tested a version with MPI_COMM_SELF and MPMD dispatch using mpiexec. Things seemed to work as desired. The mpiexec assigns different nodes and the MPI_COMM_SELF acts to 'serialize' the MPI code. — MattMatt

Hristo Iliev Hristo Iliev · Accepted Answer · 2016-02-12T12:09:37

First of all, with any decent MPI implementation you don't have to use mpiexec to start a singleton MPI job - simply run the executable (MPI standard, §10.5.2 Singleton MPI_INIT). It works at least with Open MPI and the MPICH family.

Second, any decent DRM (distributed resource manager, a.k.a. batch queueing system) supports array jobs. Those are the DRM equivalent of SPMD - multiple jobs with the same job file.

To get an array job with Torque, pass qsub the -t from-to option, either on the command line or in the job script:

#PBS -t 1-100
...

Then, in your Python script obtain the value of the PBS_ARRAYID environment variable and use it to differentiate between the different instances:

import os 
import subprocess

homedir = os.getenv('PBS_O_WORKDIR')
ievent = os.getenv('PBS_ARRAYID')

perform_workflow_management()
os.chdir(ievent)
subprocess.call('./c_exe', Shell=True)
os.chdir(homedir)

As already mentioned by @Zulan, this allows the job scheduler to better exploit the resources of the cluster via backfilling (if your Torque instance is paired with Maui or similar advanced scheduler).

The advantage of array jobs is that although from your perspective they look and work (mostly) like a single job the scheduler still sees them as separate jobs and schedule them individually.

One possible disadvantage of this approach is that if jobs are scheduled exclusively, i.e. no two jobs could share a single compute node, then the utilisation will be quite low unless your cluster nodes have just one single-core CPU each (very unlikely nowadays).

Running MPI code in embarrassingly parallel (on PBS-Torque cluster)

1 Answers