0
votes

I am using a MPI program that will equally partition a huge sample space into worker threads and do the job in parallel. I am using the following script to submit my job.

#!/bin/bash
#PBS -l nodes=NNODES
mpirun -np NPROC ./run >log

I found from my cluster webpage that we have 10 cores per node. I am naively assuming if I need 100 worker threads (NPROC), I just need to request 10 nodes (NNODES). However, I found that the walltime of the program running kept decreasing when I increased NNODES.

I am guessing this is because there is less resource competition when allocating 1 instead of multiple worker threads per node. If this is true, I thought that the wall time will be leveled off when NNODES==100(NPROC) as well, because now we will have 1 worker thread per node and it will not further reduce the wall time if we have more nodes than the requested worker threads.

However, I am wrong again, because further increasing NNODES beyond 100(NPROC) almost linearly further reduced the wall time again. This really confused me, because the source code is really reading NPROC from the above script and partition the samples into them, and I can't understand why requesting more nodes than worker threads will make things faster.

1

1 Answers

0
votes

If your program uses a lot of I/O, you might be observing a side effect that, by reserving a large number of nodes, you prevent other jobs from running, so the network, the filesystem, etc. are less stressed and your job has more resources. But if your job is CPU bound this does not apply.