3
votes

I have a Cloudwatch Alarm which receives data from a Canary. My canary attempts to visit a website, and if the website is up and responding, then the datapoint is 0, if the server returns some sort of error then the datapoint is 1. Pretty standard canary stuff I hope. This canary runs every 30 minutes.

My Cloudwatch alarm is configured as follows: enter image description here

With the expected behaviour that if my canary cannot reach the website 3 times in a row, then the alarm should go off.

Unfortunately, this is not what's happening. My alarm was triggered with the following canary data:

enter image description here

  1. Feb 8 @ 7:51 PM (MST)
  2. Feb 8 @ 8:22 PM (MST)
  3. Feb 8 @ 9:52 PM (MST)

How is it possible that these three datapoints would trigger my alarm?

My actual email was received as follows:

You are receiving this email because your Amazon CloudWatch Alarm "...." in the US West (Oregon) region has entered the ALARM state, because "Threshold Crossed: 3 out of the last 3 datapoints [1.0 (09/02/21 04:23:00), 1.0 (09/02/21 02:53:00), 1.0 (09/02/21 02:23:00)] were greater than or equal to the threshold (1.0) (minimum 3 datapoints for OK -> ALARM transition)." at "Tuesday 09 February, 2021 04:53:30 UTC".

I am even more confused because the times on these datapoints do not align. If I convert these times to MST, we have:

  1. Feb 8 @ 7:23 PM
  2. Feb 8 @ 7:53 PM
  3. Feb 8 @ 9:23 PM

The time range on the reported datapoints is a two hour window, when I have clearly specified my evaluation period as 1.5 hours.

If I view the "metrics" chart in cloudwatch for my alarm it makes even less sense:

enter image description here

The points in this chart as shown as:

  1. Feb 9 @ 2:30 UTC
  2. Feb 9 @ 3:00 UTC
  3. Feb 9 @ 4:30 UTC

Which, again, appears to be a 2 hour evaluation period.

Help? I don't understand this.

How can I configure my alarm to fire if my canary cannot reach the website 3 times in a row (waiting 30 minutes in-between checks)?

2

2 Answers

0
votes

I have two things to answer this:

  1. Every time a canary runs 1 datapoint is sent to cloudwatch. So if within 30 mins you are checking for 3 failures for alarms to be triggered then your canary should run at a interval for 10 mins. So in 30 mins 3 data point and all 3 failed data points for alarm to be triggered.

  2. For some reasons statistics was not working for me so I used count option. May be this might help.

My suggestion to run canary every 5 mins. So in 30 mins 6 data points and create alarm for if count=4.

-1
votes

The way i read your config, your alarm is expecting to find 3 data points within a 30 minute window - but your metric is only updated every 30 minutes so this condition will never be true.

You need to increase the period so there is 3 or more metrics available in order to trigger the alarm.