Running a queue of MPI calls in parallel with SLURM and limited resources

Question

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.

I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.

Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:

#!/bin/bash

#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt” == 1>
do
  for i in {0..40}
  do
    if <“RunJob_i.txt” == 1>
    then
      mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
    fi
  done
done

wait

This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.

Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.

unless a given MPI run is less than a few seconds, an option is to create a SLURM reservation, and then simply submit your jobs inside this reservation (once the reservation is active, your jobs will not spend any time in the queue) — Gilles Gouaillardet
@GillesGouaillardet That is an option I wasn't aware of, thanks! It looks like I might not have permission to do that on our cluster, but I'll try to get it sorted out tomorrow. As an alternative, I think gnu sem might work too. Still new to hpc. — user8176985

Poshi Poshi · Accepted Answer · 2018-03-07T11:01:38

A couple of things: under SLURM you should be using srun, not mpirun. The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:

#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt” == 1>
do
  for i in {0..40}
  do
    if <“RunJob_i.txt” == 1>
    then
      srun -np 8 --exclusive <job_i> &
    fi
  done
  wait
  <Update "KeepRunning.txt”>

done

Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.

The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.

Running a queue of MPI calls in parallel with SLURM and limited resources

1 Answers