TensorFlow Horovod: NCCL and MPI

Question

Horovod is combining NCCL and MPI into an wrapper for Distributed Deep Learning in for example TensorFlow. I haven't heard of NCCL previously and was looking into its functionality. The following is stated about NCCL on the NVIDIA website:

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.

From the introduction video about NCCL I understood that NCCL works via PCIe, NVLink, Native Infiniband, Ethernet and it can even detect if GPU Direct via RDMA makes sense in the current hardware topology and uses it transparently.

So I am questioning why MPI is needed in Horovod? As far as I understand, MPI is also used for efficiently exchanging the gradients among distributed nodes via an allreduce paradigm. But as I understand, NCCL already supports those functionalities.

So is MPI only used for easily scheduling the jobs on a cluster? For Distributed Deep Learning on CPU, since we cannot use NCCL there?

I would highly appreciate if someone could explain in which scenarios MPI and/or NCCL is used for Distributed Deep Learning and what are their responsibilities during the training job.

My question is more targeted into the direction which operations from NCCL are used during an training via Horovod and which operations are still needed from MPI since they overlap in terms of their functionality a lot. — Alex
@Alex: Maybe this presentation can give you some hints about usage of MPI in Horovod. — Krzysztof

eval eval · Accepted Answer · 2020-02-07T07:58:57

Firstly, horovod used MPI only in the beginning.

After NCCL is introduced to horovod, even in NCCL mode, MPI is still used for providing environmental info (rank, size and local_rank). NCCL doc has an example shows how it leverages MPI in one device per process setting:

The following code is an example of a communicator creation in the context of MPI, using one device per MPI rank.

https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/examples.html#example-2-one-device-per-process-or-thread

TensorFlow Horovod: NCCL and MPI

2 Answers