sagemaker giving UnicodeDecodeError while deserializing

Question

In sagemaker, was able to load and deploy model from s3. While deserializing the data for prediction, I am getting "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 2: invalid continuation byte" on line "results = predictor.predict(test_X)"

I tried the following sagemaker example https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/linear_time_series_forecast/linear_time_series_forecast.ipynb . I was able to create train, validate and deploy model and store model in s3.

After this I wanted to import model from s3 into sagemaker and test using the imported model. Was able to load and deploy the model, but when predicting for test values, I am getting UnicodeDecodeError

from sagemaker.predictor import csv_serializer, json_deserializer

role = get_execution_role()
sagemaker_session = sagemaker.Session()
model_data = sagemaker.session.s3_input( model_file_location_in_s3, distribution='FullyReplicated', content_type='application/x-sagemaker-model', s3_data_type='S3Prefix')
sagemaker_model = sagemaker.LinearLearnerModel(model_data=model_file,
                                       role=role, 
                                       sagemaker_session=sagemaker_session)
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

#loading test data
gas = pd.read_csv('gasoline.csv', header=None, names=['thousands_barrels'],encoding='utf-8')
gas['thousands_barrels_lag1'] = gas['thousands_barrels'].shift(1)
gas['thousands_barrels_lag2'] = gas['thousands_barrels'].shift(2)
gas['thousands_barrels_lag3'] = gas['thousands_barrels'].shift(3)
gas['thousands_barrels_lag4'] = gas['thousands_barrels'].shift(4)
gas['trend'] = np.arange(len(gas))
gas['log_trend'] = np.log1p(np.arange(len(gas)))
gas['sq_trend'] = np.arange(len(gas)) ** 2
weeks = pd.get_dummies(np.array(list(range(52)) * 15)[:len(gas)], prefix='week')
gas = pd.concat([gas, weeks], axis=1)
gas = gas.iloc[4:, ]
split_train = int(len(gas) * 0.6)
split_test = int(len(gas) * 0.3)
test_y = gas['thousands_barrels'][split_test:]
test_X = gas.drop('thousands_barrels', axis=1).iloc[split_test:, ].as_matrix()

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = json_deserializer

results = predictor.predict(test_X)
one_step = np.array([r['score'] for r in results['predictions']])

the program works fine when model is trained and deployed(as in example) but when loading from s3, it throws this error.

The test data is numpy ndarray.

the dude the dude · Accepted Answer · 2020-02-12T09:06:35

The deserializer does not seem to be appropriated for the content of the response.

To investigate, write a custom deserializer just printing some details:

def debug_deserializer(data, content_type):
    print(content_type)
    print(data)

and apply it like:

predictor.deserializer = debug_deserializer

This could, for example yield something like this:

application/x-recordio-protobuf
<botocore.response.StreamingBody object at 0x7fd3544883c8>
None

Telling you the content type is application/x-recordio-protobuf. Then write a custom deserializer as for example:

from sagemaker.amazon.common import RecordDeserializer

def recordio_protobuf_deserialize(data, content_type):
    rec_des = RecordDeserializer()
    return rec_des.deserialize(data, content_type)

and apply like:

predictor.deserializer = recordio_protobuf_deserialize

sagemaker giving UnicodeDecodeError while deserializing

1 Answers