3
votes

I am trying to run some code across multiple CPUs using MPI.

I run using:

$ mpirun -np 24 python mycode.py

I'm running on a cluster with 8 nodes, each with 12 CPUs. My 24 processes get scattered across all nodes.

Let's call the nodes node1, node2, ..., node8 and assume that the master process is on node1 and my job is the only one running. So node1 has the master process and a few slave processes, the rest of the nodes have only slave processes.

Only the node with the master process (ie node1) is being used. I can tell because nodes2-8 have load ~0 and node1 has load ~24 (whereas I would expect the load on each node to be approximately equal to the number of CPUs allocated to my job from that node). Also, each time a function is evaluated, I get it to print out the name of the host on which its running, and it prints out "node1" every time. I don't know whether the master process is the only one doing anything or if the slave processes on the same node as the master are also being used.

The cluster I'm running on was recently upgraded. Before the upgrade, I was using the same code and it behaved entirely as expected (i.e. when I asked for 24 CPUs, it gave me 24 CPUs and then used all 24 CPUs). This problem has only arisen since the upgrade, so I assume a setting somewhere got changed or reset. Has anyone seen this problem before and know how I might fix it?

Edit: This is submitted as a job to a scheduler using:

#!/bin/bash
#
#$ -cwd
#$ -pe * 24
#$ -o $JOB_ID.out
#$ -e $JOB_ID.err
#$ -r no
#$ -m n
#$ -l h_rt=24:00:00

echo job_id $JOB_ID
echo hostname $HOSTNAME

mpirun -np $NSLOTS python mycode.py

The cluster is running SGE and I submit this job using:

qsub myjob
2
It's possible that after the upgrade, the MPI implementations weren't rebuilt with SGE support, in which case you'd have to have to explicitly tell mpirun where to find the lists of hosts to run on; depending on your system configuration, something like mpirun -machinefile $PE_HOSTFILE -np $NSLOTS python mycode.py should work. But this is all stuff that your sysadmins should set up/document for you. You should be able to mpirun hostname (as vs python mycode.py) as a quick test that you are getting the hosts you expect.Jonathan Dursi
The sys admins say MPI is installed with SGE support (they even reinstalled it for me to be sure) but not joy. Setting up a hostfile has worked though, so that'll do. And that tip about using mpirun hostname was very, very helpful for testing. I was printing out the hostname as a test, but it a more complicated way - your way was much quicker!Laura

2 Answers

2
votes

It's also possible to specify where you want your jobs to run by using a hostfile. How the hostfile is formatted and used varies by MPI implementation so you'll need to consult the documentation for the one you have installed (man mpiexec) to find out how to use it.

The basic idea is that inside that file, you can define the nodes that you want to use and how many ranks you want on those nodes. This may require using other flags to specify how the processes are mapped to your nodes, but it the end, you can usually control how everything is laid out yourself.

All of this is different if you're using a scheduler like PBS, TORQUE, LoadLeveler, etc. as those can sometimes do some of this for you or have different ways of mapping jobs themselves. You'll have to consult the documentation for those separately or ask another question about them with the appropriate tags here.

1
votes

Clusters usually have a batch scheduler like PBS, TORQUE, LoadLeveler, etc. These are generally given a shell script that contains your mpirun command along with environment variables that the scheduler needs. You should ask the administrator of your cluster what the process is for submitting batch MPI jobs.