I am trying to run an MPI job on a cluster under PBS resource management. The cluster guide says I shouldn't have to worry about passing anything to mpiexec, as PBS should take care of that. For jobs on a single node, this is true and the job runs perfectly.
When I submit jobs requiring more than one node, the job exits saying it can't recognise the hosts. I included a routine in my PBS script to parse the $PBS_NODEFILE, and reconstruct a hosts file with the proper DNS suffix. PBS now recognises the hosts.
Now comes the troubling part: the hosts file I generate, isn't getting passed properly to mpiexec. See below for the hosts file I pass, and the output from the MPI process.
My hosts file:
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-2-2.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
cx1-25-3-1.cx1.hpc.ic.ac.uk
Output from the MPI process:
Host : "cx1-25-2-2.cx1.hpc.ic.ac.uk"
PID : 32752
nProcs : 24
Slaves :
23
(
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32753"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32754"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32755"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32756"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32757"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32758"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32759"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32760"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32761"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32762"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32763"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32764"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32765"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32766"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.32767"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.316"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.319"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.320"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.321"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.322"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.323"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.324"
"cx1-25-2-2.cx1.hpc.ic.ac.uk.325"
)
Should the list of processes be identical to the host file? Why doesn't mpiexec accept the host file?
The implementation is OpenMPI 1.6.0, and a MWE of my PBS script follows:
#!/bin/sh
#PBS -l walltime=40:00:00
#PBS -l select=2:ncpus=12:mpiprocs=24:mem=4gb
module load openfoam/2.3.0 libscotch
pbsdsh2 cp -r $INPUT_DIR $TMPDIR/ 2>&1
# setting up hosts file
sed -n 1~24p $PBS_NODEFILE > hosts_buffer
for ii in `cat hosts_buffer`; do echo ${ii}.cx1.hpc.ic.ac.uk slots=12; done > hosts
nprocs=24;
# execution
mpiexec --hostfile hosts -np $nprocs $SOLVER 2>&1
tm(the PBS API) and when so it can use thetminterface to both obtain information about the host list and to launch processes on remote nodes. To check if you MPI is Open MPI just issue:mpicc --showme:versionand it should print the version of Open MPI. - Hristo Ilievmpicc --showme:versionbecause it won't run if not in a PBS environment and the queues are very long, but I loaded the module and saw$MPI_ARCH_PATHand$MPI_LIBSpointing to the location for OpenMPI 1.6.0. As for the compile flag, a quickompi_info | grep tmshowed MCA ras: tm (MCA v2.0, API v2.0, Component v1.6) MCA plm: tm (MCA v2.0, API v2.0, Component v1.6) MCA ess: tm (MCA v2.0, API v2.0, Component v1.6) - linuxpirates