Recommended way of dealing with horovod and docker is: https://github.com/uber/horovod/blob/master/docs/docker.md. That's bad in a way because it leaves bash as a primary docker process and python process as a secondary. Docker logs report of bash logs, docker state is dependent on bash state, docker closes if bash process closes, etc, so it thinks its main process is bash while it should be python process we're starting. Is it possible to make python process main process in all dockers workers, primary and secondary?
I tried starting mpirun process outside instead of starting mpirun inside of the docker, with interactive docker start command as a mpirun command (docker containers were already prepared with nvidia-docker create
):
mpirun -H localhost,localhost \
-np 1 \
-bind-to none \
-map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH \
-x NCCL_SOCKET_IFNAME=^docker0,lo \
-mca btl_tcp_if_exclude lo,docker0 \
-mca oob_tcp_if_exclude lo,docker0 \
-mca pml ob1 \
-mca btl ^openib \
docker start -a -i bajaga_aws-ls0-l : \
-np 1 \
-bind-to none \
-map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH \
-x NCCL_SOCKET_IFNAME=^docker0,lo \
-mca btl_tcp_if_exclude lo,docker0 \
-mca oob_tcp_if_exclude lo,docker0 \
-mca pml ob1 \
-mca btl ^openib \
docker start -a -i bajaga_aws-ls1-l
But that failed - processes didn't communicate via horovod and were working as independent processes.
Do you know how could I achieve making python process docker main process?
mpirun
inside the container (it will fork&exec local MPI tasks inside the same container) and use theorte_launch_agent
param to have the remoteorted
(and their local MPI tasks) spawned inside a container on a remote host. – Gilles Gouaillardet