3
votes

Hello friendly people,

my question is rather specific.

For more than a week, I am trying to submit thousands of single thread jobs for a scientific experiment using sbatch and srun.

The problem is that these jobs may take different amounts of time to finish and some may even be aborted as they exceed the memory limit. Both behaviors are fine and my evaluation deals with it.

But, I am facing the problem that some of the jobs are never started, even though they have been submitted.

My sbatch script looks like this:

#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000

for i in {1..500}
do

   srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &   
   wait 5s

done

Now, my error log shows the following message:

srun: Job 1846955 step creation temporarily disabled, retrying

1) What does 'step creation temporarily disabled' mean? Are all cpu's busy and the job is omitted or is it started again later when resources are free?

2) Why are some of my jobs not carried out and how can I fix it? Do I use the correct parameters for srun?

Thanks for your help!

1

1 Answers

2
votes

srun: Job 1846955 step creation temporarily disabled, retrying

This is normal, you reserve 4 x 12 CPUs and start 500 instances of srun. Only 48 instances will run, while the other will output that message. Whenever a running instance stops, a pending instance starts.

wait 5s

The wait command is used to wait for processes, not for a certain amount of time. For that, use the sleep command. The wait command must be at the end of the script. Otherwise, the job could stop before all srun instances have finished.

So the scrip should look like this:

#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000

for i in {1..500}
do

   srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &   

done
wait