I can launch an mpi job across multiple compute nodes using a slurm batch script and srun. As part of the slurm script, I want to launch a shell script that runs on the nodes the job is using to collect information (using the top command) about the job tasks running on that node. I want the shell script to run at the node level, rather than the task level. The shell script works fine running on just a single compute node, and for jobs using a single compute node I can run it in the background as part of the slurm script. But its not clear how to get it to run on multiple compute nodes using srun. I've tried using multiple srun commands in the slurm batch script, but the shell script only starts on on compute node.
1 Answers
0
votes
I figured this out. I create a shell script wrapper to invoke the mpi code and then in the slurm script I use srun on the wrapper script. In the wrapper script I have the following conditional to invoke my shell script (sampleTop2.sh) to run one instance on each of the allocated compute nodes.
if (( ( SLURM_PROCID % SLURM_NTASKS_PER_NODE) == 0 ))
then
./sampleTop2.sh $USER $SLURMD_NODENAME 10 &
fi