2
votes

I'm trying to use our cluster but I have issues. I tried allocating some resources with:

salloc -N 1 --ntasks-per-node=5 bash

but It keeps wainting on:

salloc: Pending job allocation ...

salloc: job ... queued and waiting for resources

or when I try:

srun -N1 -l echo test

it lingers at waiting queue!

Am I making a mistake or there is something wrong with our cluster?

1
Not enough info to answer. Is the cluster empty? Or there are other jobs using it? There are free resources for you? There are higher priority jobs than your waiting in the queue? Is backfilling enabled, or it works as a FIFO?Poshi
@Poshi these are actually the kind of information I was looking for. how can I check if the cluster is empty or how much/many of the resources alr allocatable? how can i find the priority of the jobs? what is backfilling and FIFO?Foad
Check the SLURM documentation. In general. All this information is spread in the different commands offered by SLURM. Anyways, in general, if SLURM makes you wait... it is not a system error (maybe a user error, if you are asking for something that does not exist, but in that case, the state of the job is not Priority or Resources).Poshi
A quick way to run sinfo -o%C ; it shows Allocated/Idle/Other(down)/Total number of CPUsdamienfrancois
Maybe you are asking for all cores of a node and all nodes are running 1 core jobs. Or maybe you are asking for more memory per core than memory available in the machine. Or maybe there is a higher priority job in front of you asking for 700 cores and you won't start until this job get its share... There can be many reasons to have your job waiting. Before complaining, you should check that all requirements are met and the job does not start. I bet there's some resource that you need that is not available.Poshi

1 Answers

1
votes

It might help to set a time limit for the Slurm job using the option --time, for instance set a limit of 10 minutes like this:

srun --job-name="myJob" --ntasks=4 --nodes=2 --time=00:10:00 --label echo test

Without time limit, Slurm will use the partition's default time limit. The issue is that sometimes this might be set to infinity or to several days, so this might cause a delay in the start of the job. To check the partition's default time limit use:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
prod*        up   infinite    198  ....
gpu*         up 4-00:00:00     70  ....

From the Slurm docs:

-t, --time=<time> Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL.