0
votes

I'm building a Scikit-learn model on Sagemaker.

I'd like to reference the data used in training in my predict_fn. (Instead of the indices returned from NNS, I'd like to return the names and data of each neighbor.)

I know this can be done by writing/reading from S3, as in https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/ , but was wondering if there were more elegant solutions.

Are there other ways to make the data used in the training job available to the prediction function?

Edit: Using the advice from the accepted solution I was able to pass data as a dict.

model = nn.fit(train_data)

model_dict = {
   "model": model,
   "reference": train_data
}

joblib.dump(model_dict, path)

predict_fn:

def predict_fn(input_data, model_dict):
   model = model_dict["model"]
   reference = model_dict["reference"]
1

1 Answers

1
votes

you can bring to the endpoint instance (either in the model.tar.gz or via later download) a file storing the mapping between indexes and record names; this way you can translate from neighbor IDs to record names on the fly in the predict_fn or in the output_fn. For giant indexes this mapping (along with other metadata) can be in an external database too (eg dynamoDB, redis)

the link you attach (SageMaker Batch Transform) is quite a different concept; it's for instantiating ephemeral fleet of machine(s) to run a one-time prediction task with input data in S3 and results written to s3. You question seem to refer to the alternative, permanent, real-time endpoint deployment mode.