2
votes

Let's say I want to submit a slurm job just assigning the total amount of tasks (--ntasks=someNumber), without specifying the number of nodes and the tasks per node. Is there a way to know within the launched slurm script how many cores are assigned by slurm for each of the reserved nodes? I need to know this info to properly create a machinefile for the program I'm launching, that must be structured like this:

node02:7
node06:14
node09:3

Once the job is launched, the only way I figured out to see what cores have been allocated on the nodes is using the command:

scontrol show jobid -dd

In its output the abovementioned info is stored (together with plenty of other details). Is there a better way to get this info?

Thanks in advance, Lorenzo

1

1 Answers

1
votes

The way the srun documentation illustrates creating a machine file is by running srun hostname. To get the output you want you could run

srun hostname -s | sort | uniq -c | awk '{print $2":"$1}' > $MACHINEFILE

You should check the documentation of your program to see if it accepts a machine file with repetitions rather than a suffix count. If so you can simplify the command as

srun hostname -s > $MACHINEFILE

And of course the first step is actually to make sure you indeed need a machine file in the first place as many parallel programs/libraries have Slurm support and can gather the needed information from the environment variables setup by Slurm upon job start.