0
votes

Recommended way of dealing with horovod and docker is: https://github.com/uber/horovod/blob/master/docs/docker.md. That's bad in a way because it leaves bash as a primary docker process and python process as a secondary. Docker logs report of bash logs, docker state is dependent on bash state, docker closes if bash process closes, etc, so it thinks its main process is bash while it should be python process we're starting. Is it possible to make python process main process in all dockers workers, primary and secondary?

I tried starting mpirun process outside instead of starting mpirun inside of the docker, with interactive docker start command as a mpirun command (docker containers were already prepared with nvidia-docker create):

mpirun -H localhost,localhost \
-np 1 \
-bind-to none \
-map-by slot  \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH \
-x NCCL_SOCKET_IFNAME=^docker0,lo \
-mca btl_tcp_if_exclude lo,docker0 \
-mca oob_tcp_if_exclude lo,docker0 \
-mca pml ob1 \
-mca btl ^openib \
docker start -a -i bajaga_aws-ls0-l : \
-np 1 \
-bind-to none \
-map-by slot  \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH \
-x NCCL_SOCKET_IFNAME=^docker0,lo \
-mca btl_tcp_if_exclude lo,docker0 \
-mca oob_tcp_if_exclude lo,docker0 \
-mca pml ob1 \
-mca btl ^openib \
docker start -a -i bajaga_aws-ls1-l

But that failed - processes didn't communicate via horovod and were working as independent processes.

Do you know how could I achieve making python process docker main process?

1
if you are looking for containers for HPC, singularity is a much better fit IMHO. If not, start mpirun inside the container (it will fork&exec local MPI tasks inside the same container) and use the orte_launch_agent param to have the remote orted (and their local MPI tasks) spawned inside a container on a remote host.Gilles Gouaillardet

1 Answers

0
votes

Managed to execute this good enough via few tricks: * Starting container with entrypoint that runs forever until sigterm is passed * Starting mpi stuff as another process * Writting output to process 1 stdout/err, so that docker logs works * At the end of my process sending sigterm to process 1, so that whole container close.