How to solve the error with deploying a model in aws sagemaker?

Question

I have to deploy a custom keras model in AWS Sagemaker. I have a created a notebook instance and I have the following files:

AmazonSagemaker-Codeset16
   -ann
      -nginx.conf
      -predictor.py
      -serve
      -train.py
      -wsgi.py
   -Dockerfile

I now open the AWS terminal and build the docker image and push the image in the ECR repository. Then I open a new jupyter python notebook and try to fit the model and deploy the same. The training is done correctly but while deploying I get the following error:

"Error hosting endpoint sagemaker-example-2019-10-25-06-11-22-366: Failed. >Reason: The primary container for production variant AllTraffic did not pass >the ping health check. Please check CloudWatch logs for this endpoint..."

When I check the logs, I find the following:

2019/11/11 11:53:32 [crit] 19#19: *3 connect() to unix:/tmp/gunicorn.sock >failed (2: No such file or directory) while connecting to upstream, client: >10.32.0.4, server: , request: "GET /ping HTTP/1.1", upstream: >"http://unix:/tmp/gunicorn.sock:/ping", host: "model.aws.local:8080"

and

Traceback (most recent call last): File "/usr/local/bin/serve", line 8, in sys.exit(main()) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/cli/serve.py", line 19, in main server.start(env.ServingEnv().framework_module) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/_server.py", line 107, in start module_app, File "/usr/lib/python2.7/subprocess.py", line 711, in init errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child raise child_exception

I tried to deploy the same model in AWS Sagemaker with these files in my local computer and the model was deployed successfully but inside AWS, I am facing this problem.

Here is my serve file code:

from __future__ import print_function
import multiprocessing
import os
import signal
import subprocess
import sys

cpu_count = multiprocessing.cpu_count()

model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))


def sigterm_handler(nginx_pid, gunicorn_pid):
    try:
        os.kill(nginx_pid, signal.SIGQUIT)
    except OSError:
        pass
    try:
        os.kill(gunicorn_pid, signal.SIGTERM)
    except OSError:
        pass

    sys.exit(0)


def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))


    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/ml/code/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')


# The main routine just invokes the start function.
if __name__ == '__main__':
    start_server()

I deploy the model using the following:

predictor = classifier.deploy(1, 'ml.t2.medium', serializer=csv_serializer)

Kindly let me know the mistake I am doing while deploying.

Gili Nachum Gili Nachum · Accepted Answer · 2019-11-13T07:40:07

Using Sagemaker script mode can be much simpler than dealing with container and nginx low-level stuff like you're trying to do, have you considered that?
You only need to provide the keras script:

With Script Mode, you can use training scripts similar to those you would use outside SageMaker with SageMaker's prebuilt containers for various deep learning frameworks such TensorFlow, PyTorch, and Apache MXNet.

https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-sentiment-script-mode/sentiment-analysis.ipynb

How to solve the error with deploying a model in aws sagemaker?

3 Answers