3
votes

I can start/stop Sagemaker notebooks with boto3, but how do run the jupyter notebooks or .py scripts inside?

This is something I'll run from a local environment or lambda (but that's no problem).

Start Sagemaker notebook instance:

import boto3

client = boto3.client('sagemaker')

client.start_notebook_instance(
    NotebookInstanceName='sagemaker-notebook-name'
)

docs

In the UI I would just click "Open Jupyter", then run a notebook or a .py script inside it.

enter image description here

But I want to do it programmatically with boto3 or other.

My file inside is called lemmatize-input-data.ipynb.

This must be possible but I'm unsure how?

I also tried:

In a "start notebook" lifecycle configuration script, after creating a simpler test file called test_script.ipynb to be certain it wasn't something advanced in my jupyter notebook that caused the error.

set -e

jupyter nbconvert --execute test_script.ipynb

But got the error:

[NbConvertApp] WARNING | pattern 'test_script.ipynb' matched no files

3
did any of the suggestions work for you? I am also facing the same issue but it's not currently working for me. - Saurabh_Jhingan

3 Answers

2
votes

I encourage you to look into papermill. It copies and runs a template notebook, using nbconvert under the hood. What I have found the main benefit of papermill to be is you can easily parameterize notebooks and pass in the parameters via a python dictionary. The copies of the template then maintain a history of what was executed and the results.

Your code would be something like:

import papermill as pm

pm.execute_notebook(
   'lemmatize-input-data.ipynb',
   'lemmatize-input-data-####.ipynb'
)

With #### being something like datetime.now() or whatever you would like to differentiate the notebooks as they execute.

Since notebooks are intended to be living documents, you want to limit the number of external dependencies that would have breaking changes if the notebook changes and you need to re-run as of a point in time. Papermill addresses this by making a snapshot of what was executed at that time.

Update for a little more background:

I would update the jupyter notebook to contain the python code instead of the script. The notebook will execute cell by cell and act just like script. This also allows you to print and display intermediate and final values within the notebook, if needed. When papermill copies and executes the template notebook, all of the output will be displayed and saved within the notebook. This is handy for any graphs that have been generated.

Papermill also has functionality that will then aggregate data across notebooks. See here for a good article summarizing papermill in general. Papermill was designed by Netflix and they have a good post about the philosophy behind it here, in which they reference machine learning.

All this being said, papermill can be used to easily document each step of training your machine learning model in sagemaker. Then using the aggregation capabilities of papemrill, you can graphically see over time how your model changed.

1
votes

You have the correct approach of executing the notebook inside a Lifecycle Configuration script. The issue is that the working directory of the script is "/" whereas the Jupyter server starts up from /home/ec2-user/SageMaker.

So, if you modify the script to address the absolute path to the notebook file and it should work.

jupyter nbconvert --execute /home/ec2-user/SageMaker/lemmatize-input-data.ipynb

Thanks for using Amazon SageMaker!

0
votes

Have a look here. This can deploy your Jupyter Notebook as a serverless function and then you can invoke it using the REST endpoint.