It might help to set a time limit for the Slurm job using the option --time
, for instance set a limit of 10 minutes like this:
srun --job-name="myJob" --ntasks=4 --nodes=2 --time=00:10:00 --label echo test
Without time limit, Slurm will use the partition's default time limit. The issue is that sometimes this might be set to infinity or to several days, so this might cause a delay in the start of the job. To check the partition's default time limit use:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
prod* up infinite 198 ....
gpu* up 4-00:00:00 70 ....
From the Slurm docs:
-t, --time=<time>
Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL.
sinfo -o%C
; it shows Allocated/Idle/Other(down)/Total number of CPUs – damienfrancois