2
votes

I've set up an AWS CloudWatch alarm with the following parameters:

ActionsEnabled: true
AlarmActions: "some SNS topic"
AlarmDescription: "Too many HTTP 5xx errors"
ComparisonOperator: GreaterThanOrEqualToThreshold
DatapointsToAlarm: 1
Dimensions:
  - Name: ApiName
    Value: "some API"
EvaluationPeriods: 20
MetricName: 5XXError
Namespace: AWS/ApiGateway
Period: 300
Statistic: Average
Threshold: 0.1
TreatMissingData: ignore

The idea is to receive a mail when there are too many HTTP 500 errors. I believe the above gives me an alarm that evaluates time periods of 5 minutes (300s). If 1 out of 20 data points exceeds the limit (10% of the requests) I should receive an email.

This works. I receive the email. But even if the amount of errors drops below the threshold again, I seem to keep receiving emails. It seems to be more or less for the entire duration of the evaluation interval (1h40min = 20 x 5 minutes). Also, I receive these mails every 5 minutes, leading me to think there must be a connection with my configuration.

This question implies that this shouldn't happen, which seems logical to me. In fact, I'd expect not to receive an email for at least 1 hour and 40 minutes (20 x 5 minutes), even if the threshold is breached again.

This is the graph of my metric/alarm: AWS CloudWatch Alarm

Correction: I actually received 22 mails.

Have I made an error in my configuration?

Update I can see that the state is set from Alarm to OK 3 minutes after it was set from OK to Alarm:

Alarm state changes

1

1 Answers

0
votes

This is what we've found and how we fixed it.

So we're evaluating blocks of 5 minutes and taking the average of the amount of errors. But AWS is evaluating at faster intervals than 5 minutes. The distribution of your errors can be such that at a given point in time, a 5 minute block has an average of 12%. But a bit later, this block could be split in two giving you two blocks with different averages, possibly lower than the threshold.

That's what we believe is going on.

We've fixed it by changing our Period to 60s, and change our DatapointsToAlarm and EvaluationPeriods settings.