I created an event rule for the Sagemaker training job state change in cloudwatch to monitor my training jobs. Then I use this events to trigger a lambda function that send messages in a telegram group as a bot. In this way I receive a message every time one of the training job change its status. It works but there is a problem with the events, they are fired multiple times with the same exact payload, so I receive tons of duplicate messages.
Since all the payploads are identical (except the field LastModifiedTime) I cannot filter them in the lambda. Unfortunately I don't have the AWS Developer plan so I cannot receive support from Amazon. Any idea?
EDIT
There are no duplicate rules/events. I also noticed that enabling the Sagemaker profiler (which is now by default) cause the number of identical rule invocations literally explode. All of them have the same payload except for the LastModifiedTime so I suspect that there is a bug in AWS for that. One solution could be to implement some sort of data retention on the lambda and check if an invocation has already been processed, but I don't want complicate a thing that should be very simple. Just tried to launch a new training job and got this sequence (I only report the fields I parse):
Status: InProgress Secondary Status: Starting Status Message: Launching requested ML instances
Status: InProgress Secondary Status: Starting Status Message: Starting the training job
Status: InProgress Secondary Status: Starting Status Message: Starting the training job
Status: InProgress Secondary Status: Starting Status Message: Starting the training job
Status: InProgress Secondary Status: Starting Status Message: Preparing the instances for training
Status: InProgress Secondary Status: Downloading Status Message: Downloading input data
Status: InProgress Secondary Status: Training Status Message: Downloading the training image
Status: InProgress Secondary Status: Training Status Message: Training in-progres
Status: InProgress Secondary Status: Training Status Message: Training image download completed. Training in progress