I want to run an MPI job on my Kubernetes cluster. The context is that I'm actually running a modern, nicely containerised app but part of the workload is a legacy MPI job which isn't going to be re-written anytime soon, and I'd like to fit it into a kubernetes "worldview" as much as possible.
One initial question: has anyone had any success in running MPI jobs on a kube cluster? I've seen Christian Kniep's work in getting MPI jobs to run in docker containers, but he's going down the docker swarm path (with peer discovery using consul running in each container) and I want to stick to kubernetes (which already knows the info of all the peers) and inject this information into the container from the outside. I do have full control over all the parts of the application, e.g. I can choose which MPI implementation to use.
I have a couple of ideas about how to proceed:
fat containers containing slurm and the application code -> populate the slurm.conf with appropriate info about the peers at container startup -> use srun as the container entrypoint to start the jobs
slimmer containers with only OpenMPI (no slurm) -> populate a rankfile in the container with info from outside (provided by kubernetes) -> use mpirun as the container entrypoint
an even slimmer approach, where I basically "fake" the MPI runtime by setting a few environment variables (e.g. the OpenMPI ORTE ones) -> run the mpicc'd binary directly (where it'll find out about its peers through the env vars)
some other option
give up in despair
I know trying to mix "established" workflows like MPI with the "new hotness" of kubernetes and containers is a bit of an impedance mismatch, but I'm just looking for pointers/gotchas before I go too far down the wrong path. If nothing exists I'm happy to hack on some stuff and push it back upstream.
orterun
(a.k.a.mpirun
andmpiexec
) does much more than simply launching the executable multiple times. It serves as a central information broker between the ranks. Option 2 seems most reasonable. – Hristo Iliev