0
votes

I'm invoking a job with qsub myjob.pbs. In there, I have some logic to run my experiments, which includes running torchrun, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes (+gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus= and torchrun --nproc_per_node= are set depending on the command line argument.

I tried, the following:

#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"

torchrun --standalone --nnodes=1 --nproc_per_node=$1  myscript.py

and invoked it like so:

qsub --pass "4" myjob.pbs

but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?