Fargate deployment restarting multiple times before it comes online

Question

I have a ECS Service deployed into Fargate.

It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.

When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.

But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).

There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.

Fargate health check fails for the task with same docker image N number of times but after that it starts working.

Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,

Started myapp in 44.174 seconds (JVM running for 45.734)

When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.

But why NLB Health Check is failing?

Do you have any more information on the health check failures? — Chris Williams
This might help: aws.amazon.com/premiumsupport/knowledge-center/…. — Chris Williams
What are your settings for the healthchecks and target group? — Marcin

Yossi Cohn Yossi Cohn · Accepted Answer · 2020-11-11T08:41:54

I had the exact same issue. simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.

The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue). After the change, the failures would become seldom, but it would still occur. The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

Fargate deployment restarting multiple times before it comes online

1 Answers