How do I start an AWS Sagemaker training job with GPU access in my docker container?

Question

I have some python code that trains a Neural Network using tensorflow.

I've created a docker image based on a tensorflow/tensorflow:latest-gpu-py3 image that runs my python script. When I start an EC2 p2.xlarge instance I can run my docker container using the command

docker run --runtime=nvidia cnn-userpattern train

and the container with my code runs with no errors and uses the host GPU.

The problem is, when I try to run the same container in an AWS Sagemaker training job with instance ml.p2.xlarge (I also tried with ml.p3.2xlarge), the algorithm fails with error code:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Now I know what that error code means. It means that the runtime environment of the docker host is not set to "nvidia". The AWS documentation says that the command used to run the docker image is always

docker run image train

which would work if the default runtime is set to "nvidia" in the docker/deamon.json. Is there any way to edit the host deamon.json or tell docker in the Dockerfile to use "--runtime=nvidia"?

user2443088 user2443088 · Accepted Answer · 2019-03-07T13:17:35

With some help of the AWS support service we were able to find the problem. The docker image I used to run my code on was, as I said tensorflow/tensorflow:latest-gpu-py3 (available on https://github.com/aws/sagemaker-tensorflow-container)

the "latest" tag refers to version 1.12.0 at this time. The problem was not my own, but with this version of the docker image.

If I base my docker image on tensorflow/tensorflow:1.10.1-gpu-py3, it runs as it should and uses the GPU fully.

Apparently the default runtime is set to "nvidia" in the docker/deamon.json on all GPU instances of AWS sagemaker.

How do I start an AWS Sagemaker training job with GPU access in my docker container?

1 Answers