1
votes

I am running a Python script with Tensorflow in Amazon Sagemaker notebook instance. I have no trouble writing to the storage in the notebook normally, but for some reason I am unsuccessful when trying to save Tensorflow model checkpoints. This code previously worked before it was ported to Sagemaker.

Below is a reduced version of my code:

bucket = 'sagemaker-complaints-data'    
prefix = 'DeepTestV2' # place to upload training files within the bucket
timestamp = str(int(time()))
out_dir = os.path.abspath(os.path.join(bucket, prefix, "runs", timestamp))
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
path = saver.save(sess, checkpoint_prefix, global_step=current_step)
print("Saved model checkpoint to {}\n".format(path))

No errors are being thrown and the print statement is outputting the correct path. I have researched whether there are any known issues with using checkpoints in Sagemaker but have come across literally no posts describing this.

1
It might be an issue with the IAM role permission that you gave the instance or the training job. Does the role have the permission to write to that S3 bucket? You can also check the CloudWatch Logs for hints on possible errors. - Guy

1 Answers

1
votes

I have found out where this is - for some reason "checkpoints" seems to be a reserved word - changing the word to "checks" allowed me to write the folder. Hope this helps someone!