2
votes

I'm using an AWS Lambda (hourly triggered by a Cloudwatch rule) to trigger the creation of an EMR cluster to execute a job. The EMR cluster once finished its steps write a result file in a S3 bucket. The key path is the hour of the day

/bucket/2017/04/28/00/result.txt
/bucket/2017/04/28/01/result.txt
..
/bucket/2017/04/28/23/result.txt

I wanted to put some alert in case for some reason the EMR job failed to create the result.txt for the hour.

I have already put some alerts on the Lambda invocation count and on the lambda error count but I didn't manage to find the appropriate alert to test that the EMR actually correctly finishes its job.

Note that the Lambda is triggered every 3 min past the hour and takes about 15 minutes to complete. Would a good solution be to create an other Lambda that is triggered every 30min past the hour and checks that the correct key is present in the bucket? if not then write some logs to cloudwatch that I could monitor and use them to create my alert?

What other way could I achieve this alerting?

1

1 Answers

2
votes

S3 offers free metrics on object count per bucket, but doesn't publish often enough for your use case.

CloudWatch Alarm on S3 Request Metrics

For a cost, you can enable CloudWatch metrics for S3 requests to enable request metrics that write data in 1-minute periods. You could, for example, create a relevant alarm on the following S3 CloudWatch metrics:

  • PutRequests sum <= 0 over each hour
  • 4xxErrors sum >= 1 over 1 minute
  • 5xxErrors sum >= 1 over 1 minute

The HTTP status code alarms on much shorter intervals (down to 1 minute), will offer feedback nearer to when these failures occur.

CloudWatch Alarm on Put Events

If you don't want to incur the cost of S3 request metrics, you could instead configure an event to publish a message to an SNS topic on S3 put. You can use CloudWatch to set up alerting on the sum of messages published (or lack thereof).

You could then create a CloudWatch alarm based on this topic failing to publish a message.

  • Dimensions: TopicName = YOURSNSTOPIC

  • Namespace: AWS/SNS

  • Metric Name: NumberOfMessagesPublished

  • Threshold: NumberOfMessagesPublished <= 0 for 60 minutes (4 periods)

  • Statistic: Sum

  • Period: 15 minutes

  • Treat missing data as: breaching

  • Actions: Send notification to another, separate SNS topic that sends you an email/sms, or otherwise publishes to some alerting service.

Discussion

Note that both CloudWatch solutions have the caveat that they won't fire alerts exactly at 30 minutes past the hour, but they will capture your entire monitoring period.

You may be able to further configure from these base examples by adjusting your period or how cloudwatch treats missing data to get better results.

A lambda that triggers 30 minutes past the hour (via cron-style scheduling) to check S3 request metrics or the SNS topic's "NumberOfMessagesPublished" metric instead of relying on CloudWatch alarms could also accomplish this. This may be a better alternative if firing exactly 30 minutes past the hour is important, as the CloudWatch alarm's firing time will not be as precise.

Further Reading