AWS Sagemaker failure after successful training “ClientError: Artifact upload failed:Insufficient disk space”

Question

I'm training a network using custom docker image. First training with 50.000 steps everythig was ok, when I tried to increase to 80.000, I got error: "ClientError: Artifact upload failed:Insufficient disk space", I just increased the steps number.. this is weird to me. There are no errors in the cloudwatch log, my last entry is:

Successfully generated graphs: ['pipeline.config', 'tflite_graph.pb', 'frozen_inference_graph.pb', 'tflite_graph.pbtxt', 'tflite_quant_graph.tflite', 'saved_model', 'hyperparameters.json', 'label_map.pbtxt', 'model.ckpt.data-00000-of-00001', 'model.ckpt.meta', 'model.ckpt.index', 'checkpoint']

Which basically means that those files have been created because is a simple:

    graph_files = os.listdir(model_path + '/graph')

Which disk space is talking about? Also looking at the training job I see from the disk utilization chart that the rising curve peaks at 80%... I expect that after the successful creation of the aforementioned files, everything is uploaded to my s3 bucket, where no disk space issues are present. Why 50.000 steps is working and 80.000 is not working? It is my understanding that the number of training steps don't influence the size of the model files..

rok rok · Accepted Answer · 2020-05-07T10:10:23

Adding volume size to the training job selecting "additional storage volume per instance (gb)" to 5GB on the creation seems to solve the problem. I still don't understand why, but problem seems solved.

AWS Sagemaker failure after successful training “ClientError: Artifact upload failed:Insufficient disk space”

1 Answers