2
votes

I would like to request for two nodes in the same cluster, and it is necessary that both nodes are allocated before the script begins.

In the slurm script, I was wondering if there is a way to launch job-A on a given node and the job-B on the second node with a small delay or simultaneously.

Do you have suggestions on how this could be possible? This is how my script is right now.

#!/bin/bash
#SBATCH --job-name="test"
#SBATCH -D .
#SBATCH --output=./logs_%j.out
#SBATCH --error=./logs_%j.err
#SBATCH --nodelist=nodes[19,23]
#SBATCH --time=120:30:00
#SBATCH --partition=AWESOME
#SBATCH --wait-all-nodes=1

#launched on Node 1
ifconfig > node19.txt

#Launched on Node2
ifconfig >> node23.txt

In other words, if I request for two nodes, how do i run two different jobs on the two nodes simultaneously? Could it be that we deploy it as job steps as given in the last part of srun manual (MULTIPLE PROGRAM CONFIGURATION).. In that context, "-l" isn't defined.

2

2 Answers

1
votes

I'm assuming that when you say job-A and job-B you are refering the two echos in the script. I'm also assuming that the setup you show us is working, but without starting the jobs in the proper nodes and serializing the execution (I have the feeling that the requested resources are not clear, there is missing information to me, but if SLURM does not complain, then everything is OK). You should also be careful in the proper writing of the redirected output. If the first job opens the redirection after the second job, it will truncate the file and you will lose the second job output.

For them to be started in the appropriate nodes, run the commands through srun:

#!/bin/bash
#SBATCH --job-name="test"
#SBATCH -D .
#SBATCH --output=./logs_%j.out
#SBATCH --error=./logs_%j.err
#SBATCH --nodelist=nodes[19,23]
#SBATCH --time=120:30:00
#SBATCH --partition=AWESOME
#SBATCH --wait-all-nodes=1

#launched on Node 1
srun --nodes=1 echo 'hello from node 1' > test.txt &

#Launched on Node2
srun --nodes=1 echo 'hello from node 2' >> test.txt &
1
votes

That did the job! the files ./com_19.bash and ./com_23.bash are acting as binaries.

#!/bin/bash
#SBATCH --job-name="test"
#SBATCH -D .
#SBATCH --output=./logs_%j.out
#SBATCH --error=./logs_%j.err
#SBATCH --nodelist=nodes[19,23]
#SBATCH --time=120:30:00
#SBATCH --partition=AWESOME
#SBATCH --wait-all-nodes=1
# Launch on node 1
srun -lN1 -n1 -r 1 ./com_19.bash &
# launch on node 2
srun -lN1 -r 0 ./com_23.bash &
sleep 1
squeue 
squeue -s 
wait