I have some python code that trains a Neural Network using tensorflow.
I've created a docker image based on a tensorflow/tensorflow:latest-gpu-py3 image that runs my python script. When I start an EC2 p2.xlarge instance I can run my docker container using the command
docker run --runtime=nvidia cnn-userpattern train
and the container with my code runs with no errors and uses the host GPU.
The problem is, when I try to run the same container in an AWS Sagemaker training job with instance ml.p2.xlarge (I also tried with ml.p3.2xlarge), the algorithm fails with error code:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Now I know what that error code means. It means that the runtime environment of the docker host is not set to "nvidia". The AWS documentation says that the command used to run the docker image is always
docker run image train
which would work if the default runtime is set to "nvidia" in the docker/deamon.json. Is there any way to edit the host deamon.json or tell docker in the Dockerfile to use "--runtime=nvidia"?