2
votes

I need to schedule AWS Lambda to open/run a Jupyter Notebook I have inside Sagemaker to produce a csv file once a day.

I have already created my notebook instance (let's call it Model_v1) and the Lifecycle configuration needed inside Sagemaker. I can Start the instance, run the code (R) inside the Notebook, and the code writes the CSV file I require.

I have read many posts about how to use Sagemaker with Lambda, but I'm not formally using a "training job" or a "model" / endpoint etc etc. I literally just want Lambda to 1) Start Notebook instance 2) Run .ipnyb code which generates CSV

If there is an easier way to make Sagemaker run this script once a day with another tool (instead of lambda), more than happy to change!

2

2 Answers

1
votes

https://towardsdatascience.com/automating-aws-sagemaker-notebooks-2dec62bc2c84

This blog explains this situation.

  1. Start a python environment.
  2. Execute the jupyter notebook.
  3. Download an AWS sample python script containing auto-stop functionality.
  4. Wait 1 minute. Could be increased or lowered as per requirement.
  5. Create a cron job to execute the auto-stop python script.
0
votes

You can run a notebook programmatically with papermill. papermill-lambda shows how to bring the papermill dependency into lambda, but I never tried it though. A cleaner setup is to encapsulate model science in a docker container, like is done in this SageMaker R tutorial. Then you can have a lambda function launching a training job from the lambda-compatible SDK of your choice (for example boto3 create_training_job call, which is installed by default in lambda).

Note that writing the model in a sagemaker-compatible docker container enables you to benefit from the full SageMaker experience in the language of your choice - here with R, including but not limited to:

  • Training job orchestration over various type of hardware and network configuration, with multiple SDKs, including and not limited to python, CLI, js, PHP, go, ruby, java)
  • Bayesian hyperparameter search
  • Native logging of hardware usage and algorithm output, optional metric dashboard with regular expressions
  • 1-click deployment to managed real-time endpoint, optionally multi-availability zone and auto-scaled
  • Native metadata persistence (hyperparameters, data path, artifact, training configuration and duration, among others) and search.