Difference in slurm Job Array and Job Step performance

Question

I am running a set of many parallel jobs in slurm (around 1000) and each of these has to be assigned to one CPU. Reading the slurm documentation I found this:

Best Practices, Large Job Counts

Consider putting related work into a single Slurm job with multiple job steps both for performance reasons and ease of management. Each Slurm job can contain a multitude of job steps and the overhead in Slurm for managing job steps is much lower than that of individual jobs.

Job arrays are an efficient mechanism of managing a collection of batch jobs with identical resource requirements. Most Slurm commands can manage job arrays either as individual elements (tasks) or as a single entity (e.g. delete an entire job array in a single command).

This seems to imply that a single job with many job steps (e.g. one batch script with many srun calls, each with the same resources) performs better than a job array. My issue though is that I don't want to block resources for other people; if I run one job with 1000 srun calls the job will block a large number of processors constantly once it starts running, however, if I run a job array with 1000 jobs then those jobs will only use processors if they are available on the queue, which I believe is more flexible.

My question is: Is the overhead of running a job array over job steps significant enough for me to worry about this? Is there any alternative if the overhead is large? How do people usually deal with this sort of situations? I've seen people using GNU parallel with slurm in some circumstances, does it provide any advantage? Is this a possible use case?

damienfrancois damienfrancois · Accepted Answer · 2019-07-26T14:04:08

Is the overhead of running a job array over job steps significant enough for me to worry about this?

It all depends on the duration of one step. Depending on the cluster, scheduling and staring a job can take a few dozen seconds (preparing the environment, creating temporary directories, doing some cleaning and perhaps sanity checks or health checks). So if a step takes less than a couple of minutes, you definitely need to 'pack' them. Otherwise you spend as much time computing than organising the computation.

By contrast, if a step is close to the maximum wall time allowed on the cluster, you'd better use job arrays.

Note you can also go in-between and submit an array of size 10 with jobs running 100 steps.

Is there any alternative if the overhead is large?

You can use a meta-scheduler and a technique sometimes called glide-in where you submit a job that does nothing other than listening for a workflow organiser to feed it with tasks. See for instance FireWorks

How do people usually deal with this sort of situations?

They ask the system administrators guidance to see what they prefer to manage. Sometimes having small jobs might increase the total utilisation of the cluster and is good, sometimes have many small jobs decreases the performances of the scheduling.

I've seen people using GNU parallel with slurm in some circumstances, does it provide any advantage?

GNU Parallel has very powerful tools to generate the job steps, for instance computing all the pairwise possible values for a pair of parameters, or advanced globbing on files, etc.

It also allows replacing a few lines of Bash with a single one to handle the starting of all steps.

Is this a possible use case?

Yes you could use it but it will not help you make a decision about your primary question.

Difference in slurm Job Array and Job Step performance

Best Practices, Large Job Counts

1 Answers