20
votes

I want to run an MPI job on my Kubernetes cluster. The context is that I'm actually running a modern, nicely containerised app but part of the workload is a legacy MPI job which isn't going to be re-written anytime soon, and I'd like to fit it into a kubernetes "worldview" as much as possible.

One initial question: has anyone had any success in running MPI jobs on a kube cluster? I've seen Christian Kniep's work in getting MPI jobs to run in docker containers, but he's going down the docker swarm path (with peer discovery using consul running in each container) and I want to stick to kubernetes (which already knows the info of all the peers) and inject this information into the container from the outside. I do have full control over all the parts of the application, e.g. I can choose which MPI implementation to use.

I have a couple of ideas about how to proceed:

  1. fat containers containing slurm and the application code -> populate the slurm.conf with appropriate info about the peers at container startup -> use srun as the container entrypoint to start the jobs

  2. slimmer containers with only OpenMPI (no slurm) -> populate a rankfile in the container with info from outside (provided by kubernetes) -> use mpirun as the container entrypoint

  3. an even slimmer approach, where I basically "fake" the MPI runtime by setting a few environment variables (e.g. the OpenMPI ORTE ones) -> run the mpicc'd binary directly (where it'll find out about its peers through the env vars)

  4. some other option

  5. give up in despair

I know trying to mix "established" workflows like MPI with the "new hotness" of kubernetes and containers is a bit of an impedance mismatch, but I'm just looking for pointers/gotchas before I go too far down the wrong path. If nothing exists I'm happy to hack on some stuff and push it back upstream.

2
I doubt option 3 would work. Open MPI's orterun (a.k.a. mpirun and mpiexec) does much more than simply launching the executable multiple times. It serves as a central information broker between the ranks. Option 2 seems most reasonable.Hristo Iliev

2 Answers

2
votes

I tried MPI Jobs on Kubernetes for a few days and solved it by using dnsPolicy:None and dnsConfig (CustomDNS=true feature gate will be needed).

I pushed my manifests (as Helm chart) here.

https://github.com/everpeace/kube-openmpi

I hope it would help.

1
votes

Assuming you don't want to use hw-specific MPI library (for example anything that uses direct access to communication fabric), I would go with option 2.

  1. First, implement a wrapper for mpirun which populates necessary data using kubernetes API, specifically using endpoints if using a service (might be a good idea), could also scrape pod's exposed ports directly.

  2. Add some form of checkpoint program that can be used for "rendezvous" synchronization before starting actual running code (I don't know how well MPI deals with ephemeral nodes). This is to ensure that when mpirun starts it has stable set of pods to use

  3. And finally actually build a container with necessary code and I guess SSH service for mpirun to use for starting processes in other pods.


Another interesting option would be to use Stateful Sets, possibly even running with SLURM inside, which implement a "virtual" cluster of MPI machines running on kubernetes.

This provides stable hostnames for each node, which would reduce the problem of discovery and keeping track of state. You could also use statefully-assigned storage for container's local work filesystem (which, with some work, could be made to for example always refer to same local SSD).

Another benefit is that it would be probably least invasive to the actual application.