1
votes

I am working on AWS SageMaker notebook example and when I play "Inference Pipeline with Scikit-learn and Linear Learner", I have an issue when it comes to fit the SKLearn model.

The code in the example is :

from sagemaker.sklearn.estimator import SKLearn

script_path = 'sklearn_abalone_featurizer.py'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session)

sklearn_preprocessor.fit({'train': train_input})

When I run this, I get an error :

ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied

So I changed the sklearn_preprocessor to :

sklearn_preprocessor = SKLearn(
    output_path='s3://{}/{}/model'.format(s3_bucket, prefix),
    entry_point=script_path,
    role=role,
    train_instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session)

Where s3_bucket is the name of my bucket and prefix is the path into it.

But still, SKLearn wants to create a bucket even if it already exists. When I fit a AWS' model using the same output_path, it works fine. Is there a way to solve this problem without changing the authorization policy ?

EDIT : I edited the role of my notebook instance and the training could run but it did create a bucket "INFO:sagemaker:Created S3 bucket: sagemaker-eu-west-1-************" in which it saved the model artifact. How can I force it to save the artifact in a given bucket.

1

1 Answers

0
votes

The estimator should only creates the buckets when output_path is not specified:

https://github.com/aws/sagemaker-python-sdk/blob/ab1f7587bf1c35a54549cc676c273dea356301e4/src/sagemaker/estimator.py#L199

I am not able to reproduce this either. I started a hosted Notebook instance on AWS SageMaker copied over the sample notebook did the same modification:

   from sagemaker.sklearn.estimator import SKLearn

    script_path = 'sklearn_abalone_featurizer.py'

    sklearn_preprocessor = SKLearn(
        entry_point=script_path,
        output_path='s3://<my_bucket>/',
        role=role,
        train_instance_type="ml.c4.xlarge",
        sagemaker_session=sagemaker_session)

The training job runs and finishes without creating any additional bucket. I was able to find the trained model in my existing bucket.

Sometimes it's hard to track what code is actually running in a Jupyter Notebook, Did you rerun the cell that creates the SKLearn object after modifying it?