2
votes

We get this message (via email) several times a day:

ALARM: "elb-production-UnHealthHostCount" in US - N. Virginia

You are receiving this email because your Amazon CloudWatch Alarm "elb-production-UnHealthHostCount" in the US - N. Virginia region has entered the ALARM state, because "Threshold Crossed: 1 datapoint (0.2) was greater than the threshold (0.0)." at "Thursday 21 January, 2016 17:39:39 UTC".

View this alarm in the AWS Management Console: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#s=Alarms&alarm=elb-production-UnHealthHostCount

Alarm Details: - Name: elb-production-UnHealthHostCount - Description: - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 1 datapoint (0.2) was greater than the threshold (0.0). - Timestamp: Thursday 21 January, 2016 17:39:39 UTC - AWS Account: 1234567890

Threshold: - The alarm is in the ALARM state when the metric is GreaterThanThreshold 0.0 for 60 seconds.

Monitored Metric: - MetricNamespace: AWS/ELB - MetricName: UnHealthyHostCount - Dimensions: [LoadBalancerName = production] - Period: 60 seconds - Statistic: Average - Unit: not specified

State Change Actions: - OK: - ALARM: [arn:aws:sns:us-east-1:1234567890:DevOps] - INSUFFICIENT_DATA:

However, upon viewing our nginx log files, it appears that AWS was able to contact each of our servers around the time the alarm was "set off". In other words, our ec2 instances returned 200 on each request to /healthcheck around Thursday 21 January, 2016 17:39:39 UTC.

AWS seems to check each of our instances every 30 seconds or so.

Has anyone experienced this issue? If so, what have you done about it?

1
The datapoint of 0.2 suggests that it might have been unhealthy for a portion of the alarm period, or at least took a while to respond as healthy. Perhaps change the threshold to be >= 1 rather than > 0?John Rotenstein

1 Answers

0
votes

I've updated a few settings from ...

  • Whenever: UnHealthyHostCount > 0
  • Statistic: Average

... to ...

  • Whenever: UnHealthyHostCount >= 1
  • Statistic: Maximum

I will update this answer if my problem continues to occur.


UPDATE:

The problem continued to occur :/

I've updated one more setting on my current UnHealthyHostCount alarm ...

for 1 consecutive period(s)

... to ...

for 2 consecutive period(s)

... and I've created a new alarm to track if multiple servers are down for a single period ...

enter image description here

I will update this answer if my problem continues to occur.