I'm invoking a job with qsub myjob.pbs
. In there, I have some logic to run my experiments, which includes running torchrun
, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes (+gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus=
and torchrun --nproc_per_node=
are set depending on the command line argument.
I tried, the following:
#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"
torchrun --standalone --nnodes=1 --nproc_per_node=$1 myscript.py
and invoked it like so:
qsub --pass "4" myjob.pbs
but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'
. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?