my question is rather specific.
For more than a week, I am trying to submit thousands of single thread jobs for a scientific experiment using sbatch and srun.
The problem is that these jobs may take different amounts of time to finish and some may even be aborted as they exceed the memory limit. Both behaviors are fine and my evaluation deals with it.
But, I am facing the problem that some of the jobs are never started, even though they have been submitted.
My sbatch script looks like this:
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
wait 5s
Now, my error log shows the following message:
srun: Job 1846955 step creation temporarily disabled, retrying
1) What does 'step creation temporarily disabled' mean? Are all cpu's busy and the job is omitted or is it started again later when resources are free?
2) Why are some of my jobs not carried out and how can I fix it? Do I use the correct parameters for srun?
