2
votes

I am trying to execute a socket programming code on a cluster using SLURM for node allocation. I used slurm script as below:

#!/bin/bash
#SBATCH --job-name="abcd"
#SBATCH --ntasks=2
#SBATCH --nodes=2-2
#SBATCH --cpus-per-task=128
#SBATCH --partition=knl
./a.out

When running this as sbatch script I get an error "sbatch: error: Batch job submission failed: Requested node configuration is not available".

However, I do see some nodes satisfying above config. scontrol output for two nodes shown below:

NodeName=compute140 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=20 CPUErr=0 CPUTot=256 CPULoad=20.01
   AvailableFeatures=knl
   ActiveFeatures=knl
   Gres=(null)
   NodeAddr=compute140 NodeHostName=compute140 Version=16.05
   OS=Linux RealMemory=96000 AllocMem=81920 FreeMem=102580 Sockets=1 Boards=1
   MemSpecLimit=1024
   State=MIXED ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2018-06-04T12:41:22 SlurmdStartTime=2018-06-04T12:47:01
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=compute141 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=20 CPUErr=0 CPUTot=256 CPULoad=20.01
   AvailableFeatures=knl
   ActiveFeatures=knl
   Gres=(null)
   NodeAddr=compute141 NodeHostName=compute141 Version=16.05
   OS=Linux RealMemory=96000 AllocMem=81920 FreeMem=87441 Sockets=1 Boards=1
   MemSpecLimit=1024
   State=MIXED ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2018-06-04T12:46:37 SlurmdStartTime=2018-06-04T12:52:11
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I am not sure why am I getting the error when slurm should allocate me the requested config.

I want to run client-server application on two different knl nodes each task would be multithreaded with 128 threads per task.

Please help as I tried several things but nothing is working for me.

1
What is the value of DefMemPerCPU in the configuration?damienfrancois
@damienfrancois Value of DefMemPerCPU is 4096.Mayank Jain

1 Answers

2
votes

You do not specify explicitly the memory requirement per CPU, so the default applies. If the default is larger than RealMemory/CPUTot, in your case 96000MB/128=750MB, then the tasks cannot hold in one single node.

So if the default is 4GB/CPU, and you request one task per node and 128CPUs per tasks, you effectively request 524GB of RAM per node, which your cluster cannot offer.