All,
We recently had an issue with ELB HealthCheck in covering up a certain use-case or scenario which caused an application impact.
Can anyone suggest a fault-tolerant approach to handle this?
- We have a nodeJS app running in a port - 80
- We have 3 instances in the Target Group & that is enrolled in ELB.
- ELB HealthCheck was configured to hit root path on port 80 and return success if it gets HTTP 200
- Recently one of the node had 100% disk filled on application mount and root mount was still having space.
- Though the HealthCheck was succeeding as per ELB the server didn't respond for any other services and it was ideally unhealthy. This means that there are some requests that got succeeded but some of them failed (that was routed to this disk-filled server).
- We did received notifications from other monitoring systems on disk filling but due to overwhelming emails & limited resources it got missed out.
- Is there any other way we can improvise the HealthCheck strategy to just have these scenarios intimated to AutoScaling Group or ELB so that we can target these nodes to be removed and replace them automatically?