1
votes

All,

We recently had an issue with ELB HealthCheck in covering up a certain use-case or scenario which caused an application impact.

Can anyone suggest a fault-tolerant approach to handle this?

  1. We have a nodeJS app running in a port - 80
  2. We have 3 instances in the Target Group & that is enrolled in ELB.
  3. ELB HealthCheck was configured to hit root path on port 80 and return success if it gets HTTP 200
  4. Recently one of the node had 100% disk filled on application mount and root mount was still having space.
  5. Though the HealthCheck was succeeding as per ELB the server didn't respond for any other services and it was ideally unhealthy. This means that there are some requests that got succeeded but some of them failed (that was routed to this disk-filled server).
  6. We did received notifications from other monitoring systems on disk filling but due to overwhelming emails & limited resources it got missed out.
  7. Is there any other way we can improvise the HealthCheck strategy to just have these scenarios intimated to AutoScaling Group or ELB so that we can target these nodes to be removed and replace them automatically?
1

1 Answers

4
votes

Rather than just checking that the index.htm page is returning a 200 response, you can configure Elastic Load Balancing to point to a customer Health Check page (eg healthcheck.php).

You could run some code on that page to test the general health of the application (database connectivity, disk space, free memory). If everything checks out OK, return a 200 response. If something is wrong, return a 500 response. This will cause the Load Balancer to treat the instance as Unhealthy and it will stop serving traffic to the instance.

If Auto Scaling is configured to use the ELB Health Check, then Auto Scaling will terminate the unhealthy instance and automatically replace it with a new instance.