I am following the mnist-2 guide from the aws github documentation to implement my own training job https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving. I have wrote my code using a similar structure, but I would like to visualise the training and validation metrics from Cloudwatch while the job is running. Do I need to manually specify the metrics I am trying to observe? The AWS guide states "SageMaker automatically parses the logs for metrics that built-in algorithms emit and sends those metrics to CloudWatch." I am only using Tensorflow's training and validation accuracy and loss metrics, which I am not sure if they are built-in, or if I need to call them manually.
1 Answers
If you are not using a built-in algorithm, like in the example you linked, you have to define your metrics when you create the training job. You have to define regex expressions to grab from the logs the metric values, then cloudwatch will plot for you. The x axis will be the timestamp, you cannot change it. Basically just run your traning job and observe how the metrics are outputted, then you can build the appropriate regex. For example, since I am using coco metrics in tensorflow which periodically produce this:
INFO:tensorflow:Saving dict for global step 1109: DetectionBoxes_Precision/mAP = 0.111895345, DetectionBoxes_Precision/mAP (large) = 0.12102994, DetectionBoxes_Precision/mAP (medium) = 0.050807837, DetectionBoxes_Precision/mAP (small) = -1.0, DetectionBoxes_Precision/[email protected] = 0.33130914, DetectionBoxes_Precision/[email protected] = 0.03787096, DetectionBoxes_Recall/AR@1 = 0.18493989, DetectionBoxes_Recall/AR@10 = 0.36792925, DetectionBoxes_Recall/AR@100 = 0.48543888, DetectionBoxes_Recall/AR@100 (large) = 0.5131599, DetectionBoxes_Recall/AR@100 (medium) = 0.21598063, DetectionBoxes_Recall/AR@100 (small) = -1.0, Loss/classification_loss = 0.8041124, Loss/localization_loss = 0.35313264, Loss/regularization_loss = 0.15211834, Loss/total_loss = 1.30936, global_step = 1109, learning_rate = 0.28119853, loss = 1.30936
I use to grab the total loss for example:
INFO.*Loss\/total_loss = ([0-9\.]+)
That's it, cloudwatch automatically plot the total_loss in time.
You can define metrics either in the console or in the notebook, like this (just an example from my code):
metrics = [{'Name': 'Loss', 'Regex': 'loss: ([0-9\.]+)'},
{'Name': 'Accuracy', 'Regex': 'acc: ([0-9\.]+)'},
{'Name': 'Epoch', 'Regex': 'Epoch ([0-9\.]+)'},
{'Name': 'Validation_Acc', 'Regex': 'val_acc: ([0-9\.]+)'},
{'Name': 'Validation_Loss', 'Regex': 'val_loss: ([0-9\.]+)'}]
tf_estimator = TensorFlow(entry_point='training.py',
role=get_execution_role(),
train_instance_count=1,
train_instance_type='ml.p2.xlarge',
train_max_run=172800,
output_path=s3_output_location,
framework_version='1.12',
py_version='py3',
metric_definitions = metrics,
hyperparameters = hyperparameters)
In order to test your regex, you can use a tool like this