We've got a quite simple setup which causes us major headaches:
- HTTP API Gateway with a S3 Integration for our static HTML/JS and a
ANY /api/{proxy+}
route to a Fargate Service/Tasks accessible via Cloud Map - ECS Cluster with a "API service" using Fargate and a Container Task exposing Port 8080 via
awsvpc
. No autoscaling. Min healthy: 100%, Max: 200%. - Service discovery using
SRV
DNS record withTTL 60
- The ECS service/tasks is completely bored/idling and always happy to accept requests while logging them.
Problem:
We receive intermittent HTTP 503 Service Unavailable
for some of our requests. A new deployment (with task redeployment) increases the rate, but even after 10-15 minutes they still occur intermittently.
In Cloud Watch we see the failing 503 Requests
2020-06-05T14:19:01.810+02:00 xx.117.163.xx - - [05/Jun/2020:12:19:01 +0000] "GET ANY /api/{proxy+} HTTP/1.1" 503 33 Np24bwmwsiasJDQ=
but it seems like they do not reach a living backend instance.
We enabled VPC Flow Logs and it seems that HTTP API Gateway wants to route some requests to stopped tasks even after they've gone long for good (far exceeding 60s).
More puzzling: If we keep the system busy, the rate drops to nearly zero. Otherwise after a longer period of idling the intermittent errors seem to reoccur.
Questions
- How can we fix this issue?
- Are there options to further pinpoint the root issue?