Partition "fat" nodes into multiple Slurm nodes

Question

According to the SLURM FAQ:

Can Slurm emulate a larger cluster? Yes, this can be useful for testing purposes. It has also been used to partition "fat" nodes into multiple Slurm nodes. There are two ways to do this. The best method for most conditions is to run one slurmd daemon per emulated node in the cluster as follows.

Assume we have a single node with 10 GPUs and 40 CPU cores. Can this be used to virutally split the node into 10 nodes with 4 cores are 1 GPU each with explicit CPU/GPU binding? If so, how does the configuration need to look like?

Bub Espinja Bub Espinja · Accepted Answer · 2020-04-25T10:28:39

You could create 10 virtual machines with the specs that you need (4 cores and 1 GPU) all connected to the same network. Then launch in each VM a slumrd daemon (and one of them with the slurmctld).

Something like this:

You must bind the cores to the VMs, to have more accurate behavior. But if it is for testing purposes, maybe that's not a big deal.

I think that this approach is quite straightforward for what you want. Furthermore, this approach allows you to configure Slurm as usual.

The configuration would be:

NodeName=compute-[0-9] CPUs=4 Gres=gpu:1
PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Hope that helps!

Partition "fat" nodes into multiple Slurm nodes

1 Answers