0
votes

I am using a computer cluster with 20 nodes and each node has 16 CPU. I tried to submit 1000 jobs to all nodes with the command "sbatch XX.sbatch". What I want is that 320 jobs are running simultaneously, i.e., 16 jobs per node, or 1 job per CPU.

When I use the . sbatch file with the parameters in the XX sbatch file is

#!/bin/bash
# Interpreter declaration
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J job_XX

./example.sh

I noticed only 1 job is running on each node.

Then I tried

#!/bin/bash
# Interpreter declaration
#SBATCH -N 20
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J job_XX

./example.sh

I noticed only 1 job is running in 20 nodes, i.e., 1 job per 20 nodes.

Then I tried

#!/bin/bash
# Interpreter declaration
#SBATCH -N 20
#SBATCH -n 320
#SBATCH -c 1
#SBATCH --ntasks-per-node=16
#SBATCH -J job_XX

./example.sh

Still, 1 job is using all 20 nodes.

Does anyone know how to fix it? Thanks.

1

1 Answers

1
votes

Well, if you want more than one job, you need to submit more than one job. If you only call sbatch XX.sbatch once, only one job will be created (not quite correct, see below).

If you want to create 1000 jobs, you could simply go ahead and create a for loop to submit 1000 jobs:

for i in {1..1000}
    do sbatch XX.sbatch
done

This would create 1000 Jobs with 1 core each (if we take your first jobscript as an example) and they would fill up all available 320 job slots. But: Calling sbatch in a for loop like that is not nice for the scheduler. There is a better way to submit a number of similar jobs: Job Arrays.

These submit a single job script any number of times at once. Inside the job script, you can then use environment variables such as $SLURM_ARRAY_TASK_ID to control your scripts so they all do exactly the same.

Take your first jobscript for example:

#!/bin/bash
# Interpreter declaration
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J job_XX
#SBATCH --array=0-1000

#Do something with the env vars e.g. use them as parameters for your script
./example.sh $SLURM_ARRAY_TASK_ID

When submitting this with sbatch XX.sbatch, this creates 1000 jobs at once, each using a single core, therefore filling up all 320 cores available.