64
votes

I am trying to run a private repository on aws-ecs-fargate-1.4.0 platform.

For private repository authentication, I have followed the docs and it was working well.

Somehow after updating existing service many times it goes fail to run the task and complain the error like

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to get registry auth from asm: service call has been retried 1 time(s): asm fetching secret from the service for <secretname>: RequestError: ...

I haven't change the ecsTaskExecutionRole and it contains all required policies to fetch secret value.

  1. AmazonECSTaskExecutionRolePolicy
  2. CloudWatchFullAccess
  3. AmazonECSTaskExecutionRolePolicy
  4. GetSecretValue
  5. GetSSMParamters
13
This should have been related to the security group of your ECS. Make sure your inbound rules are correct (Protocole, port, ...) and that the outbound rules are allowing all traffic out (I got the error above because my outbound rule was set to a specific port)Shams Larbi

13 Answers

72
votes

AWS employee here.

What you are seeing is due to a change in how networking works between Fargate platform version 1.3.0, and Fargate platform version 1.4.0. As part of the change from using Docker to using containerd we also made some changes to how networking works. In version 1.3.0 and below each Fargate task got two network interfaces:

  • One network interface was used for the application traffic from your application container(s), as well as for logs and container image layer pulls.
  • A secondary network interface was used by the Fargate platform itself, to get ECR authentication credentials, and fetch secrets.

This secondary network interface had some downsides though. This secondary traffic did not show up in your VPC flow logs. Also while most traffic stayed in the customer VPC, the secondary network interface was sending traffic outside of your VPC. A number of customers complained that they did not have the ability to specify network level controls on this secondary network interface and what it was able to connect to.

To make the networking model less confusing and give customers more control, we changed in Fargate platform version 1.4.0 to using a single network interface and keeping all traffic inside of your VPC, even the Fargate platform traffic. The Fargate platform traffic for fetching ECR authentication and task secrets now uses the same task network interface as the rest of your task traffic, and you can observe this traffic in VPC flow logs, and control this traffic using the routing table in your own AWS VPC.

However, with this increased ability to observe and control the Fargate platform networking, you also become responsible for ensuring that there is actually a network path configured in your VPC that allows the task to communicate with ECR and AWS Secrets Manager.

There are a few ways to solve this:

  • Launch tasks into a public subnet, with a public IP address, so that they can communicate to ECR and other backing services using an internet gateway
  • Launch tasks in a private subnet that has a VPC routing table configured to route outbound traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection to ECR on behalf of the task.
  • Launch tasks in a private subnet and make sure you have AWS PrivateLink endpoints configured in your VPC, for the services you need (ECR for image pull authentication, S3 for image layers, and AWS Secrets Manager for secrets).

You can read more about this change in this official blogpost, under the section "Task elastic network interface (ENI) now runs additional traffic flows"

https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/

14
votes

I'm not completely sure about your setup but after I disabled the NAT-Gateways to save some $, I had a very similar error message on the aws-ecs-fargate-1.4.0 platform:

Stopped reason: ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr....

It turned out that I had to create VPC Endpoints to these Service names:

  • com.amazonaws.REGION.s3
  • com.amazonaws.REGION.ecr.dkr
  • com.amazonaws.REGION.ecr.api
  • com.amazonaws.REGION.logs
  • com.amazonaws.REGION.ssm

And I had to downgrade to the aws-ecs-fargate-1.3.0 platform. After the downgrade the Docker images could be pulled from ECR and the deployments succeeded again.

If you are using the secret manager without a NAT-Gateway, it might be that you have to create a VPC Endpoint for com.amazonaws.REGION.secretsmanager.

11
votes

This error occurs when the Fargate agent fails to create or bootstrap the resources required to start the container or the task is belongs to. This error only occurs if using platform version 1.4 or later, most likely because the version 1.4 uses Task ENI (which is in your VPC) instead of the Fargate ENI (which is in AWS's VPC). I'd think this might be caused by some need for extra IAM permissions needed to pull image from ECR. Are you using any privatelink? If yes, you might wanna take a look at the policies for ECR endpoint.

I'll try to replicate it but I'd suggest opening a support Ticket with AWS if you can so they can take a closer look at your resources and better suggest.

10
votes

Ensure internet connectivity either via IGW or NAT and make sure public IP is Enabled, if its IGW in Fargate Task/Service network configuration.

{
  "awsvpcConfiguration": {
    "subnets": ["string", ...],
    "securityGroups": ["string", ...],
    "assignPublicIp": "ENABLED"|"DISABLED"
  }
}
6
votes

Since ECS agent in FARGATE version 1.4.0 uses task ENI to retrieve information, the request to the Secret Manager will go through this eni.

You must ensure that the trafic to the Secret Manager api (secretsmanager.{region}.amazonaws.com) is 'open' :

  • if your task is private you must either have a vpc endpoint (com.amazonaws.{region}.secretsmanager) or a NAT gateway and the task ENI's security group must allow https outbound trafic to it.

  • if your task is public, the security group must allow https outbound trafic to the outside (or AWS public cidrs).

6
votes

If you are using a public subnet and select "Don't assign public address", this error can happen.

The same is applicable if you have a private subnet and do not have an internet gateway or NAT gateway in your VPC. It needs a route to the internet.

This is the same behaviour across all of AWS ecosystem. It would be great if AWS can display a large banner warning in such cases.

3
votes

I was having the exact same issue using Fargate as the launch type with the platform version 1.4.0. At the end, since I was using public subnets, all I needed to do was to enable the assignment of public ip to the tasks in order to allow the task to have outbound network access to pull the image.

I got the hint to solve it when I tried to create the service with using the platform version 1.3.0 and the task creation failed with a similar but fortunately documented error.

1
votes

I resolved a similar problem by updating rules in ECS Service's Security Group. Below rules configuration.

Inbound Rules:
* HTTP          TCP   80    0.0.0.0/0
Outbound Rules:
* All traffic   All   All   0.0.0.0/0

0
votes

If your Fargate is running in a private subnet with no access to internet, technically within your vpc should already have dkr vpc endpoint in place such that your Fargate (ver 1.3 and below) could reach to that endpoint and spin up the container. For ver 1.4 of Fargate, just need additional api ecr endpoint.

https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/

0
votes

I just had this issue and the reason I was getting it was because I forgot to add inbound and outbound rules to the security group associated with my service. (added inbound from my ALB and outbound *)

0
votes

The service's security group needs outbound access on port 443 (outbound access on all ports will work for this). Without this, it can't access Secrets Manager

0
votes

for me it was a combination of not having secretsmanagerreadwrite policy attached to my IAM role (thanks Jinkko); AND not having public ip enabled on the compute instance (to get to the ECR repo)

0
votes

This has burned me sufficiently well today that I figured I'd share my experience, since it differs from most all the above (AWS Employee's answer covers it technically, but doesn't spell the problem out).

If all the following are true:

  • You're running platform 1.4.0 (or, newer presumably - at the time of writing, 1.4.0 is the latest)
  • You're in a VPC environment
  • Your VPC, for "reasons", runs its own DNS (i.e. not at VPC_BASE+2)
  • For "reasons", you don't allow all outbound traffic, so you're setting egress rules on your task security group

And consequently, you have endpoints for all the things, then the following must also be true:

  • Your homegrown DNS will need to be able correctly resolve the private addresses of the endpoints (for instance, using VPC_BASE+2, but how doesn't matter)
  • You will also need to make sure your task security group has rules allowing DNS traffic to your DNS server(s) <-- This one burned me.

To add insult to the injury, what little error information you get out of Fargate doesn't really indicate that you have a DNS issue, and naturally your CloudTrails won't show a damn thing either, since nothing ends up hitting the API to start with.