Model training using Azure Container Instance with GPU much slower than local test with same container

Question

I am trying to train a Yolo computer vision model using a container I built which includes an installation of Darknet. The container is using the Nvidia supplied base image: nvcr.io/nvidia/cuda:9.0-devel-ubuntu16.04

Using Nvidia-Docker on my local machine with a gtx 1080 ti, training runs very fast, however that same container running as an Azure Container Instance with a P100 gpu trains very slowly. It's almost as if it's not utilizing the gpu. I also noticed that the "nvidia-smi" command does not work in the container running in Azure, but it does work when I ssh into the container running locally on my machine.

Here is the Dockerfile I am using

FROM nvcr.io/nvidia/cuda:9.0-devel-ubuntu16.04
LABEL maintainer="[email protected]" \
      description="Pre-Configured Darknet Machine Learning Environment" \
      version=1.0

# Container Dependency Setup
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install software-properties-common -y
RUN apt-get install vim -y
RUN apt-get install dos2unix -y
RUN apt-get install git -y
RUN apt-get install wget -y
RUN apt-get install python3-pip -y
RUN apt-get install libopencv-dev -y

# setup virtual environment
WORKDIR /
RUN pip3 install virtualenv
RUN virtualenv venv
WORKDIR venv
RUN mkdir notebooks
RUN mkdir data
RUN mkdir output


# Install Darknet
WORKDIR /venv
RUN git clone https://github.com/AlexeyAB/darknet
RUN sed -i 's/GPU=0/GPU=1/g' darknet/Makefile
RUN sed -i 's/OPENCV=0/OPENCV=1/g' darknet/Makefile
WORKDIR /venv/darknet
RUN make

# Install common pip packages
WORKDIR /venv
COPY requirements.txt ./
RUN . /venv/bin/activate && pip install -r requirements.txt

# Setup Environment
EXPOSE 8888
VOLUME ["/venv/notebooks", "/venv/data", "/venv/output"]
CMD . /venv/bin/activate && jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root

The requirements.txt file is as shown below:

jupyter
matplotlib
numpy
opencv-python
scipy
pandas
sklearn

Alex Schultz Alex Schultz · Accepted Answer · 2019-04-26T18:29:42

The issue was that my training data was on an Azure File Share volume and the network latency was causing the training to be slow. I copied the data from the share into my container and then pointed the training to it and everything ran much faster.

Model training using Azure Container Instance with GPU much slower than local test with same container

1 Answers