2
votes

I am trying to create a multi-model endpoint in AWS sagemaker using Scikit-learn and a custom training script. When I attempt to train my model using the following code:

estimator = SKLearn(
    entry_point=TRAINING_FILE, # script to use for training job
    role=role,
    source_dir=SOURCE_DIR, # Location of scripts
    train_instance_count=1,
    train_instance_type=TRAIN_INSTANCE_TYPE,
    framework_version='0.23-1',
    output_path=s3_output_path,# Where to store model artifacts
    base_job_name=_job,
    code_location=code_location,# This is where the .tar.gz of the source_dir will be stored
    hyperparameters = {'max-samples'    : 100,
                       'model_name'     : key})

DISTRIBUTION_MODE = 'FullyReplicated'

train_input = sagemaker.s3_input(s3_data=inputs+'/train', 
                                  distribution=DISTRIBUTION_MODE, content_type='csv')
    
estimator.fit({'train': train_input}, wait=True)

where 'TRAINING_FILE' contains:


import argparse
import os

import numpy as np
import pandas as pd
import joblib
import sys

from sklearn.ensemble import IsolationForest

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--max_samples', type=int, default=100)
    
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--model_name', type=str)

    args, _ = parser.parse_known_args()

    print('reading data. . .')
    print('model_name: '+args.model_name)    
    
    train_file = os.path.join(args.train, args.model_name + '_train.csv')    
    train_df = pd.read_csv(train_file) # read in the training data
    train_tgt = train_df.iloc[:, 1] # target column is the second column
    
    clf = IsolationForest(max_samples = args.max_samples)
    clf = clf.fit([train_tgt])
    
    path = os.path.join(args.model_dir, 'model.joblib')
    joblib.dump(clf, path)
    print('model persisted at ' + path)

The training script succeeds but sagemaker throws an UnexpectedStatusException: enter image description here

Has anybody ever experienced anything like this before? I've checked all the cloudwatch logs and found nothing of use, and I'm completely stumped on what to try next.

1
Not a useful error message by AWS!Abdul Jabbar Azam

1 Answers

1
votes

To anyone that comes across this issue in future, the problem has been solved.

The issue was nothing to do with the training, but with invalid characters in directory names being sent to S3. So the script would produce the artifacts correctly, but sagemaker would throw an exception when trying to save them to S3