Let's say I have started an MPI job with 256 cores on 16 nodes.
I have an MPI program, but it is not unfortunately parallel over one parameter. Fortunately, I can easily create my own MPI program, which could handle parallelization of that parameter, only if I obtain the output files.
So, how can I start an MPI job (from within an MPI job), which uses a particular subset of these cores, namely only a particular node? So basically I want to run 16 distinct MPI calculations all with 16 cores, from within a single 256 core MPI job. These calculations take about 10 minutes with 16 cores, and there are about 200 iterations in the outer loop. With 256 cores, that is a reasonable 32 hours. It is not plausible to either resubmit 200 times, or run these 16 calculations sequentially.
To be even more precise, here is some python-pseudo-code for what I want to do:
from ase.parallel import world, rank
from os import system, chdir
while 1:
node = rank // 16
subrank = rank % 16
chdir(mydir+"Calculation_%d" % node)
# This will not work, one needs to specify somehow that only ranks from node*16 to node*16+15 will be used
os.system("mpirun -n 16 nwchem input.nw > nwchem.out")
analyse_output(mydir+"Calculation_%d/nwchem.out" % node)
Basicly, with 16 cores and 4 jobs:
rank 0: start nwchem process in /calculation0/ as rank 0/4.
rank 1: start nwchem in /calculation0/ as rank 1/4.
rank 2: start nwchem in /calculation0/ as rank 2/4.
rank 3: start nwchem in /calculation0/ as rank 3/4.
rank 4: start nwchem in /calculation1/ as rank 0/4.
rank 5: start nwchem in /calculation1/ as rank 1/4.
rank 6: start nwchem in /calculation1/ as rank 2/4.
rank 7: start nwchem in /calculation1/ as rank 3/4.
rank 8: start nwchem in /calculation2/ as rank 0/4.
rank 9: start nwchem in /calculation2/ as rank 1/4.
rank 10: start nwchem in /calculation2/ as rank 2/4.
rank 11: start nwchem in /calculation2/ as rank 3/4.
rank 12: start nwchem in /calculation3/ as rank 0/4.
rank 13: start nwchem in /calculation3/ as rank 1/4.
rank 14: start nwchem in /calculation3/ as rank 2/4.
rank 15: start nwchem in /calculation3/ as rank 3/4.
Gather all the results.
Optimize all geometries (this requires knowledge of forces between the calculations).
Repeat until convergence (about 200 times).
Background: In case you are interested, I will elaborate details here. But the main question still is "How to instantiate N MPI calculations from a single MPI calculation of M cores, which each have their M / N cores.
NWChem does not have image-parallel nudged elastic band calculator. Here is an example of this process, with a different code: GPAW. https://wiki.fysik.dtu.dk/gpaw/tutorials/neb/neb.html
Here it is smooth, because it is so easy to create a sub-communicator with GPAW and it's MPI interface. However, I only have the nwchem runtime MPI, and I wish to do the same thing: create many calculators (a band or a chain of geometries, which are all linked with 'springs', and optimize that chain.)