0
votes

I am trying to use TrainingJobAnalytics to plot the training and validation loss curves for a training job using XGBoost on SageMaker. The training job completes successfully and I can see the training and validation rmse values in the CloudWatch logs.

However when I try to get them in my notebook using TrainingJobAnalytics, I only get the metrics for a single timestamp and not all of them.

My code is as below:

metrics_dataframe = TrainingJobAnalytics(training_job_name=job_name).dataframe()

What's going wrong and how can I fix it?

1
Not an answer to this question, but you can try sagemaker debugger which provides you several flexibility like plotting metrics, custom visualizations and alerting in case there are problems found in running training code, without changing your code, take a look at example here - github.com/awslabs/amazon-sagemaker-examples/blob/master/…Vikas
Consider creating an issue here: github.com/aws/sagemaker-python-sdk/issuesGili Nachum
could you show your code for starting the XGBoost training job?lauren

1 Answers

0
votes

I went the rabbit hole, with this one, but let me share my experience with monitoring training data on SageMaker "out-of-the-box".

TL;DR; Monitoring runs on 1-minute intervals resolution, thus any logs shortened than one minute are omitted. SageMaker Debugger is also explored as an alternative. SMD scalar minimalistic example gist.

So, to begin with, the same issue has been mentioned a couple of times:

None of them, however, has received a good explanation of why this is happening. So I decided to read through Amazon's official documentation.

https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-retrieve-data-point-metrics/

If the metric is a high-resolution metric (pushed at a sub-1 minute interval), confirm that the data points to the metric are pushed with the --storage resolution parameter set to 1. Without this configuration, CloudWatch doesn't store the sub-minute data points and aggregates them into one-minute data points. In these cases, data points for a sub-minute period aren't retrievable.

https://aws.amazon.com/cloudwatch/faqs/

Q: What resolution can I get from a Custom Metric?

https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html#define-train-metrics

Amazon CloudWatch supports high-resolution custom metrics and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see GetMetricStatistics in the Amazon CloudWatch API Reference.

https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs

Metrics are available at a 1-minute frequency.

So, basically for my scenario Amazon CloudWatch wasn't tooling that fit my needs.

I decided to explore SageMaker Debugger, and oh man was that hard. In theory, it also works out of the box. And it probably does, but not in a trivial "call a logger" way. You need to:

  • configure it correctly first (what you need to monitor)
  • use preexisting conventions for the most popular libraries
  • hook it up to your model/pipeline
  • a lot of "behind-the-scenes" functionality
  • feels like specifically made for those 2 scenarios that are always being presented in any educational videos about SageMaker debugger.

I must admit though, it is quite powerful if you are an Amazon Engineer and know how to use it and when to use it.

Finally, I decided to write a simple local debugger, which monitors a single value and then displays it - took me around 8-10 hours, as I wasn't following their conventions (and the documentation never covered the "simplest example possible"). Providing it here as a gist:

https://gist.github.com/yoandinkov/d431ffef708599cb7f24a653305d1b8f

This is based on following references:

To finalize this "Alice in the (not so) Wonderland" experience, use W&B or Tensorboard. Otherwise, you'll need a substantial amount of time and a steep learning curve to understand what is going on "out-of-the-box". Might be beneficial after a while though, I don't know. (I, personally, won't use it at the current time being)

And let's not forget the most important part - have fun while exploring the myriad of possibilities in this vast weird internet place.