How to fix intermittent 503 Service Unavailable after idling/redeployments on AWS HTTP API Gateway & Fargate/ECS?

Question

We've got a quite simple setup which causes us major headaches:

HTTP API Gateway with a S3 Integration for our static HTML/JS and a ANY /api/{proxy+} route to a Fargate Service/Tasks accessible via Cloud Map
ECS Cluster with a "API service" using Fargate and a Container Task exposing Port 8080 via awsvpc. No autoscaling. Min healthy: 100%, Max: 200%.
Service discovery using SRV DNS record with TTL 60
The ECS service/tasks is completely bored/idling and always happy to accept requests while logging them.

Problem:

We receive intermittent HTTP 503 Service Unavailable for some of our requests. A new deployment (with task redeployment) increases the rate, but even after 10-15 minutes they still occur intermittently.

In Cloud Watch we see the failing 503 Requests

2020-06-05T14:19:01.810+02:00 xx.117.163.xx - - [05/Jun/2020:12:19:01 +0000] "GET ANY /api/{proxy+} HTTP/1.1" 503 33 Np24bwmwsiasJDQ=

but it seems like they do not reach a living backend instance.

We enabled VPC Flow Logs and it seems that HTTP API Gateway wants to route some requests to stopped tasks even after they've gone long for good (far exceeding 60s).

More puzzling: If we keep the system busy, the rate drops to nearly zero. Otherwise after a longer period of idling the intermittent errors seem to reoccur.

Questions

How can we fix this issue?
Are there options to further pinpoint the root issue?

xalves xalves · Accepted Answer · 2020-07-06T14:37:15

I was facing this issues and solved it by configuring my ALB being internal, instead of internet-facing(regarding the scheme). Hope it may help someone with the same issue.

Context: The environment is API Gateway + ALB(ECS)

Update The first ALB I configured was to manage my backend services. Recently I also did another ALB(to deal with my front-end instances), in this case, I exposed a public IP(instead of just a private one). This could be achieved by changing the scheme to internet-facing, at first I thought this would bring the same problem as I had before, then I figured that it was something pretty simple. I just needed to add a policy to allow traffic from the internet to the ALB I created.

How to fix intermittent 503 Service Unavailable after idling/redeployments on AWS HTTP API Gateway & Fargate/ECS?

Problem:

Questions

3 Answers