Using mpi4py, I'm running a python program which launches multiple fortran processes in parallel, starting from a SLURM script using (for example):
mpirun -n 4 python myprog.py
but have noticed that myprog.py takes longer to run the higher the number of tasks requested eg. running myprog.py (following code shows only the mpi part of program):
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
data = None
if rank == 0:
data = params
recvbuf = np.empty(4, dtype=np.float64)
comm.Scatter(data, recvbuf, root=0)
py_task(int(recvbuf[0]), recvbuf[1], recvbuf[2], int(recvbuf[3]))
with mpirun -n 1 ... on a single recvbuf array takes 3min- whilst running on four recvbuf arrays (expectedly in parallel), on four processors using mpirun -n 4 ... takes about 5 min. However, I would expect run times to be approximately equal for both the single and four processor cases.
py_task is effectively a python wrapper to launch a fortran program using:
subprocess.check_call(cmd)
There seems to be some interaction between subprocess.check_call(cmd) and the mpi4py package that is stopping the code from properly operating in parallel.
I've looked up this issue but can't seem to find anything that's helped it. Are there any fixes to this issue/ detailed descriptions explaining what's going on here/ recommendations on how to isolate the cause of the bottleneck in this code?
Additional note:
This pipeline has been adapted to mpi4py from "joblib import Parallel", where there was no previous issues with the subprocess.check_call() running in parallel, and is why I suspect this issue is linked with the interaction between subprocess and mpi4py.