8
votes

We've got a quite simple setup which causes us major headaches:

  1. HTTP API Gateway with a S3 Integration for our static HTML/JS and a ANY /api/{proxy+} route to a Fargate Service/Tasks accessible via Cloud Map
  2. ECS Cluster with a "API service" using Fargate and a Container Task exposing Port 8080 via awsvpc. No autoscaling. Min healthy: 100%, Max: 200%.
  3. Service discovery using SRV DNS record with TTL 60
  4. The ECS service/tasks is completely bored/idling and always happy to accept requests while logging them.

Problem:

We receive intermittent HTTP 503 Service Unavailable for some of our requests. A new deployment (with task redeployment) increases the rate, but even after 10-15 minutes they still occur intermittently.

In Cloud Watch we see the failing 503 Requests

2020-06-05T14:19:01.810+02:00 xx.117.163.xx - - [05/Jun/2020:12:19:01 +0000] "GET ANY /api/{proxy+} HTTP/1.1" 503 33 Np24bwmwsiasJDQ=

but it seems like they do not reach a living backend instance.

We enabled VPC Flow Logs and it seems that HTTP API Gateway wants to route some requests to stopped tasks even after they've gone long for good (far exceeding 60s).

More puzzling: If we keep the system busy, the rate drops to nearly zero. Otherwise after a longer period of idling the intermittent errors seem to reoccur.

Questions

  1. How can we fix this issue?
  2. Are there options to further pinpoint the root issue?
3

3 Answers

2
votes

I was facing this issues and solved it by configuring my ALB being internal, instead of internet-facing(regarding the scheme). Hope it may help someone with the same issue.

Context: The environment is API Gateway + ALB(ECS)

Update The first ALB I configured was to manage my backend services. Recently I also did another ALB(to deal with my front-end instances), in this case, I exposed a public IP(instead of just a private one). This could be achieved by changing the scheme to internet-facing, at first I thought this would bring the same problem as I had before, then I figured that it was something pretty simple. I just needed to add a policy to allow traffic from the internet to the ALB I created.

1
votes

Though we were never able to really pinpoint down the issue we've come to the conclusion that this was a combination of

  • temporary internal AWS issues causing long delays for HTTP API Gateway to adopt Route 53 zone updates (used for service discovery) and
  • the absence of an Elastic Load Balancer (ELB)

Replacing the API Gateway with Cloudfront functionality and introducing an AWS Application Load Balancer switched the method for service discovery: Instead of a Route 53 zone the ELB manages the available ECS/Fargate tasks on their own. This salvaged this issue for us besides a few other minor ones.

1
votes

What worked for me was, in addition to configuring my ALB's scheme as internal as xaalves did, also putting the ALB in an Isolated or a Private subnet. Previously I had my ALB in Public subnets. bentolor's experience got me thinking that some sort of DNS resolution was going haywire, and sure enough that appeared to be the case. Now 100% of my HTTP calls complete successfully.