all.
In my Slurm cluster, when a srun or sbatch job requests resources more than one node, it will not be submitted correctly.
This Slurm cluster has 4 nodes, each node has 4 GPUs.
I can execute multiple jobs with 4 GPUs at the same time.
But I can't run a job request 5 GPUs or more.
The following message will show that the cise3 status is down, this is another problem.
error message:
sbatch: error: Batch job submission failed: Requested node configuration is not available
start.sh:
#!/bin/bash
#SBATCH -o code20.out
#SBATCH --partition=cup-hpc
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=40
#SBATCH --gres=gpu:5
#SBATCH --mem-per-cpu=100mb
source /home/slurm/tensorflow_prj/tf_gpu_cluster/bin/activate
python3 /nfs/code/code20.py
slurm.conf:
NodeName=cise1 NodeAddr=10.18.19.191 CPUs=40 RealMemory=94887 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
NodeName=cise2 NodeAddr=10.18.19.107 CPUs=40 RealMemory=94889 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
NodeName=cise3 NodeAddr=10.18.19.47 CPUs=40 RealMemory=94889 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
NodeName=cise4 NodeAddr=10.18.19.183 CPUs=40 RealMemory=94889 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4000:4
PartitionName=cup-hpc Nodes=cise[1-4] Default=YES MaxTime=INFINITE State=UP
gres.conf:
# Configure support for four GPUs (with MPS), plus bandwidth
AutoDetect=nvml
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
sinfo:
[root@localhost nfs]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cup-hpc* up infinite 1 down* cise3
cup-hpc* up infinite 3 idle cise[1-2,4]
scontrol show nodes:
[root@localhost nfs]# scontrol show nodes
NodeName=cise1 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUTot=40 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx4000:4
NodeAddr=10.18.19.191 NodeHostName=cise1 Version=20.02.1
OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
RealMemory=94887 AllocMem=0 FreeMem=83727 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cup-hpc
BootTime=2020-04-13T08:34:13 SlurmdStartTime=2020-04-17T14:49:20
CfgTRES=cpu=40,mem=94887M,billing=40
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=cise2 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUTot=40 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx4000:4
NodeAddr=10.18.19.107 NodeHostName=cise2 Version=20.02.1
OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
RealMemory=94889 AllocMem=0 FreeMem=83405 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cup-hpc
BootTime=2020-04-13T08:33:51 SlurmdStartTime=2020-04-17T14:49:33
CfgTRES=cpu=40,mem=94889M,billing=40
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=cise3 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUTot=40 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx4000:4
NodeAddr=10.18.19.47 NodeHostName=cise3 Version=20.02.1
OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
RealMemory=94889 AllocMem=0 FreeMem=83456 Sockets=2 Boards=1
State=DOWN* ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cup-hpc
BootTime=2020-04-13T08:31:48 SlurmdStartTime=2020-04-17T15:10:16
CfgTRES=cpu=40,mem=94889M,billing=40
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2020-04-17T15:17:58]
NodeName=cise4 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUTot=40 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx4000:4
NodeAddr=10.18.19.183 NodeHostName=cise4 Version=20.02.1
OS=Linux 4.18.0-80.el8.x86_64 #1 SMP Tue Jun 4 09:19:46 UTC 2019
RealMemory=94889 AllocMem=0 FreeMem=83432 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cup-hpc
BootTime=2020-04-13T08:36:40 SlurmdStartTime=2020-04-17T14:49:23
CfgTRES=cpu=40,mem=94889M,billing=40
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
scontrol show config | grep ^SelectType
? – damienfrancoisscontrol show nodes
command again but while the four jobs are running and look atallocTRES
,CPUAlloc
. What are they? – damienfrancois