1
votes

In sagemaker, was able to load and deploy model from s3. While deserializing the data for prediction, I am getting "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 2: invalid continuation byte" on line "results = predictor.predict(test_X)"

I tried the following sagemaker example https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/linear_time_series_forecast/linear_time_series_forecast.ipynb . I was able to create train, validate and deploy model and store model in s3.

After this I wanted to import model from s3 into sagemaker and test using the imported model. Was able to load and deploy the model, but when predicting for test values, I am getting UnicodeDecodeError

from sagemaker.predictor import csv_serializer, json_deserializer

role = get_execution_role()
sagemaker_session = sagemaker.Session()
model_data = sagemaker.session.s3_input( model_file_location_in_s3, distribution='FullyReplicated', content_type='application/x-sagemaker-model', s3_data_type='S3Prefix')
sagemaker_model = sagemaker.LinearLearnerModel(model_data=model_file,
                                       role=role, 
                                       sagemaker_session=sagemaker_session)
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

#loading test data
gas = pd.read_csv('gasoline.csv', header=None, names=['thousands_barrels'],encoding='utf-8')
gas['thousands_barrels_lag1'] = gas['thousands_barrels'].shift(1)
gas['thousands_barrels_lag2'] = gas['thousands_barrels'].shift(2)
gas['thousands_barrels_lag3'] = gas['thousands_barrels'].shift(3)
gas['thousands_barrels_lag4'] = gas['thousands_barrels'].shift(4)
gas['trend'] = np.arange(len(gas))
gas['log_trend'] = np.log1p(np.arange(len(gas)))
gas['sq_trend'] = np.arange(len(gas)) ** 2
weeks = pd.get_dummies(np.array(list(range(52)) * 15)[:len(gas)], prefix='week')
gas = pd.concat([gas, weeks], axis=1)
gas = gas.iloc[4:, ]
split_train = int(len(gas) * 0.6)
split_test = int(len(gas) * 0.3)
test_y = gas['thousands_barrels'][split_test:]
test_X = gas.drop('thousands_barrels', axis=1).iloc[split_test:, ].as_matrix()

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = json_deserializer

results = predictor.predict(test_X)
one_step = np.array([r['score'] for r in results['predictions']])

the program works fine when model is trained and deployed(as in example) but when loading from s3, it throws this error.

The test data is numpy ndarray.

1
i have the same issue, did you manage to fix this? - clog14

1 Answers

1
votes

The deserializer does not seem to be appropriated for the content of the response.

To investigate, write a custom deserializer just printing some details:

def debug_deserializer(data, content_type):
    print(content_type)
    print(data)

and apply it like:

predictor.deserializer = debug_deserializer

This could, for example yield something like this:

application/x-recordio-protobuf
<botocore.response.StreamingBody object at 0x7fd3544883c8>
None

Telling you the content type is application/x-recordio-protobuf. Then write a custom deserializer as for example:

from sagemaker.amazon.common import RecordDeserializer

def recordio_protobuf_deserialize(data, content_type):
    rec_des = RecordDeserializer()
    return rec_des.deserialize(data, content_type)

and apply like:

predictor.deserializer = recordio_protobuf_deserialize