0
votes

all.

In my Slurm cluster, when a srun or sbatch job requests resources more than one node, it will not be submitted correctly.

This Slurm cluster has 4 nodes, each node has 4 GPUs.

I can execute multiple jobs with 4 GPUs at the same time.

But I can't run a job request 5 GPUs or more.

The following message will show that the cise3 status is down, this is another problem.

error message:

sbatch: error: Batch job submission failed: Requested node configuration is not available

start.sh:

#!/bin/bash
#SBATCH -o code20.out
#SBATCH --partition=cup-hpc
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=40
#SBATCH --gres=gpu:5
#SBATCH --mem-per-cpu=100mb

source /home/slurm/tensorflow_prj/tf_gpu_cluster/bin/activate
python3 /nfs/code/code20.py

slurm.conf:

NodeName=cise1 NodeAddr=10.18.19.191 CPUs=40 RealMemory=94887 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
NodeName=cise2 NodeAddr=10.18.19.107 CPUs=40 RealMemory=94889 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
NodeName=cise3 NodeAddr=10.18.19.47  CPUs=40 RealMemory=94889 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
NodeName=cise4 NodeAddr=10.18.19.183 CPUs=40 RealMemory=94889 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
PartitionName=cup-hpc Nodes=cise[1-4] Default=YES MaxTime=INFINITE State=UP

gres.conf:

# Configure support for four GPUs (with MPS), plus bandwidth
AutoDetect=nvml
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3

sinfo:

[root@localhost nfs]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cup-hpc*     up   infinite      1  down* cise3
cup-hpc*     up   infinite      3   idle cise[1-2,4]

scontrol show nodes:

[root@localhost nfs]# scontrol show nodes
NodeName=cise1 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUTot=40 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:rtx4000:4
   NodeAddr=10.18.19.191 NodeHostName=cise1 Version=20.02.1
   OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
   RealMemory=94887 AllocMem=0 FreeMem=83727 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cup-hpc
   BootTime=2020-04-13T08:34:13 SlurmdStartTime=2020-04-17T14:49:20
   CfgTRES=cpu=40,mem=94887M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=cise2 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUTot=40 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:rtx4000:4
   NodeAddr=10.18.19.107 NodeHostName=cise2 Version=20.02.1
   OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
   RealMemory=94889 AllocMem=0 FreeMem=83405 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cup-hpc
   BootTime=2020-04-13T08:33:51 SlurmdStartTime=2020-04-17T14:49:33
   CfgTRES=cpu=40,mem=94889M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=cise3 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUTot=40 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:rtx4000:4
   NodeAddr=10.18.19.47 NodeHostName=cise3 Version=20.02.1
   OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
   RealMemory=94889 AllocMem=0 FreeMem=83456 Sockets=2 Boards=1
   State=DOWN* ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cup-hpc
   BootTime=2020-04-13T08:31:48 SlurmdStartTime=2020-04-17T15:10:16
   CfgTRES=cpu=40,mem=94889M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2020-04-17T15:17:58]

NodeName=cise4 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUTot=40 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:rtx4000:4
   NodeAddr=10.18.19.183 NodeHostName=cise4 Version=20.02.1
   OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
   RealMemory=94889 AllocMem=0 FreeMem=83432 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cup-hpc
   BootTime=2020-04-13T08:36:40 SlurmdStartTime=2020-04-17T14:49:23
   CfgTRES=cpu=40,mem=94889M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
1
what is the value shown by scontrol show config | grep ^SelectType ?damienfrancois
SelectType = select/cons_res SelectTypeParameters = CR_COREgaiaismus
ok then run the scontrol show nodes command again but while the four jobs are running and look at allocTRES , CPUAlloc. What are they?damienfrancois
I changed #SBATCH --nodes=1 and submitted three same tasks because one of the nodes is down. cise[1-2,4] these three nodes show CPUAlloc=40 and AllocTRES=cpu=40. These three tasks are run on three different node.gaiaismus
Then you are consuming all 40 CPUs of each nodes with the three jobs, and the fourth one cannot start event if some GPUs are available.damienfrancois

1 Answers

1
votes

The slurm gres specification is a per node request. Your job is currently requesting 3 nodes with 5 GPUs each.

Source: https://slurm.schedmd.com/gres.html#Running_Jobs