How to preprocess training data on s3 without using notebooks for built-in algorithms

Question

I want to avoid using sagemaker notebook and preprocess data before training like simply changing the from csv to protobuf format as shown in the first link below for the built-in models.

https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html

In the following example it explains preprocessing by using sklearn pipelines with the help of sagemaker python-sdk

https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/

What are the best practices if you just need to do format like changes and you don't need to use sklearn way of processing.

raj raj · Accepted Answer · 2019-03-29T18:04:46

It's not necessary to use SageMaker Notebook instances to perform pre-processing or training. Notebooks are way to explore and carry out experiments. For production use cases, you can orchestrate tasks in a ML pipeline such as pre-processing, data preparation (feature engineering, format conversion etc.), model training and evaluation using AWS Step Functions. Julien has covered it in his recent talk here.

You can explore using AWS Glue for pre-processing either using Python script (via Python Shell) or Apache Spark (Glue job). Refer this blog here for such use case https://aws.amazon.com/blogs/machine-learning/ensure-consistency-in-data-processing-code-between-training-and-inference-in-amazon-sagemaker/

How to preprocess training data on s3 without using notebooks for built-in algorithms

1 Answers