I went the rabbit hole, with this one, but let me share my experience with monitoring training data on SageMaker "out-of-the-box".
TL;DR; Monitoring runs on 1-minute intervals resolution, thus any logs
shortened than one minute are omitted. SageMaker Debugger is also
explored as an alternative. SMD scalar minimalistic example gist.
So, to begin with, the same issue has been mentioned a couple of times:
None of them, however, has received a good explanation of why this is happening. So I decided to read through Amazon's official documentation.
https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-retrieve-data-point-metrics/
If the metric is a high-resolution metric (pushed at a sub-1 minute
interval), confirm that the data points to the metric are pushed with
the --storage resolution parameter set to 1. Without this
configuration, CloudWatch doesn't store the sub-minute data points and
aggregates them into one-minute data points. In these cases, data
points for a sub-minute period aren't retrievable.
https://aws.amazon.com/cloudwatch/faqs/
Q: What resolution can I get from a Custom Metric?
https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html#define-train-metrics
Amazon CloudWatch supports high-resolution custom metrics and its
finest resolution is 1 second. However, the finer the resolution, the
shorter the lifespan of the CloudWatch metrics. For the 1-second
frequency resolution, the CloudWatch metrics are available for 3
hours. For more information about the resolution and the lifespan of
the CloudWatch metrics, see GetMetricStatistics in the Amazon
CloudWatch API Reference.
https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs
Metrics are available at a 1-minute frequency.
So, basically for my scenario Amazon CloudWatch wasn't tooling that fit my needs.
I decided to explore SageMaker Debugger
, and oh man was that hard. In theory, it also works out of the box. And it probably does, but not in a trivial "call a logger" way. You need to:
- configure it correctly first (what you need to monitor)
- use preexisting conventions for the most popular libraries
- hook it up to your model/pipeline
- a lot of "behind-the-scenes" functionality
- feels like specifically made for those 2 scenarios that are always being presented in any educational videos about SageMaker debugger.
I must admit though, it is quite powerful if you are an Amazon Engineer and know how to use it and when to use it.
Finally, I decided to write a simple local debugger, which monitors a single value and then displays it - took me around 8-10 hours, as I wasn't following their conventions (and the documentation never covered the "simplest example possible"). Providing it here as a gist:
https://gist.github.com/yoandinkov/d431ffef708599cb7f24a653305d1b8f
This is based on following references:
To finalize this "Alice in the (not so) Wonderland" experience, use W&B or Tensorboard. Otherwise, you'll need a substantial amount of time and a steep learning curve to understand what is going on "out-of-the-box". Might be beneficial after a while though, I don't know. (I, personally, won't use it at the current time being)
And let's not forget the most important part - have fun while exploring the myriad of possibilities in this vast weird internet place.