3
votes

We want to build an ECS cluster with the following characteristics:

  1. It must run inside a VPC, then, we need the awsvpc mode
  2. It must use GPU instances, so we can't use Fargate
  3. It must provision dynamically the instances, therefore, we need a capacity provider
  4. It will run tasks (batch jobs) that are going to be triggered directly through the AWS ECS API. For this reason, we don't need a service, only a task definition.
  5. These tasks must have access to S3 (internet), so according to AWS documentation the instances must be placed inside a private subnet (a reference to docs).

We've already read this post in stackoverflow where it says that we need to set up a private subnet with a route table that points to a NAT Gateway configured in a public subnet, and this public subnet should point to an internet gateway. We already have this configuration. We also have an S3 vpc endpoint configured in the route table.

Bellow, you can see some relevant configurations of the cluster in terraform (for the shake of simplicity I only put the relevant parts):


# Launch template
resource "aws_launch_template" "train-launch-template" {
  name_prefix   = "{var.project_name}-launch-template-${var.env}"
  image_id      = "ami-01f62a207c1d180d2"
  instance_type = "m5.large"
  key_name="XXXXXX"
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs-instance-profile.name
  }
  user_data = base64encode(data.template_file.user_data.rendered)

  network_interfaces {
    associate_public_ip_address = false
    security_groups = [aws_security_group.ecs_service.id]
  }
}


# Task definition
resource "aws_ecs_task_definition" "task" {
  family                   = "${var.project_name}-${var.env}-train-task"
  execution_role_arn       = data.aws_iam_role.ecs_task_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_train_task_role.arn
  requires_compatibilities = ["EC2"]
  cpu                      = var.ecs_cpu
  network_mode             = "awsvpc"
  memory                   = var.ecs_memory
  container_definitions    = data.template_file.app_definition.rendered

  tags = {
    Stage   = var.env_tag
    Project = var.project_name_tag
  }
}


# Cluster
resource "aws_ecs_cluster" "cluster" {
  name = "${var.project_name}-${var.env}-train-ecs-cluster"
  capacity_providers = [aws_ecs_capacity_provider.train-capacity-provider.name]
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.train-capacity-provider.name
  }
  tags = {
    Project = var.project_name_tag
    Stage   = var.env_tag
  }
}

We also have configured all the roles needed for the instances and the task to access to the required resources (S3, ECR, ECS).

The AMI corresponds to an ECS optimized instance (the last version published at this moment in eu-west-1).

In the launch template we've removed the public IP to the instances due to the explanation in this link

We've evolved to this configuration trying to make this work, but once and again we've faced the same problem: when the task is triggered, the capacity provider launches an instance, but the task is never placed in the container instance and remains in the PROVISIONING status indefinitely.

With the same configuration but placing the instances into a public subnet, the tasks are placed into the container instances, but, as warned in the first link, the task has no access to the internet.

We need some enlightenment or a trace to follow. Thank you in advance.

UPDATE: As requested I've added the rest part concerning to autoscale

resource "aws_autoscaling_group" "train-autoscaling" {
  availability_zones = ["eu-west-1b"]
  desired_capacity   = 0
  max_size           = 10
  min_size           = 0
  protect_from_scale_in = true
  

  launch_template {
    id      = aws_launch_template.train-launch-template.id
    version = "$Latest"
  }

  tags = [
    {
      key = "Project",
      value = var.project_name_tag
      propagate_at_launch = true
    },
    {
      key = "Stage",
      value = var.env_tag
      propagate_at_launch = true
    }
  ]
}

resource "aws_ecs_capacity_provider" "train-capacity-provider" {
  name = "${var.project_name}-${var.env}-train-capacity-provider"

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.train-autoscaling.arn
    managed_termination_protection = "ENABLED"

    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      maximum_scaling_step_size = 1
      minimum_scaling_step_size = 1
    }
  }
}

data "template_file" "user_data" {
  template = "${file("${path.module}/user_data.sh")}"

  vars = {
    cluster_name = "${var.project_name}-${var.env}-train-ecs-cluster"
  }
}

Update 2 (AWS Console info):

Container instances running Container instances running

Detail container instance: enter image description here

Pending Task: pending task

Pending task details: pending task details

Update 3:

After 30 minutes the task stops and this is the message shown (Task failed to start): enter image description here

Update 4:

logs from container instance. ecs-agent.log

level=info time=2020-08-28T11:09:21Z msg="Loading configuration" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Amazon ECS agent Version: 1.44.1, Commit: 1f05fbf0" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2020-08-28T11:09:21Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2020-08-28T11:09:21Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:21Z msg="Loading state!" module=state_manager.go
level=info time=2020-08-28T11:09:23Z msg="Registering Instance with ECS" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Remaining mem: 7680" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registered container instance with cluster!" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registration completed successfully. I am running as 'arn:aws:ecs:eu-west-1:XXXXXXXXXXXXXXXX:container-instance/foqum-read-dev-train-ecs-cluster/95559f936f8d44de9373595009fcd588' in cluster 'foqum-read-dev-train-ecs-cluster'" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Beginning Polling for updates" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Initializing stats engine" module=engine.go
level=info time=2020-08-28T11:09:23Z msg="Event stream DeregisterContainerInstance start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXXX-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXXXX%3Acontainer-instance%2FXXXXXXXX-cluster%2F95559fXXXXXXde9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:09:23Z msg="NO_PROXY set:XXX.254.169.XXXX,XXXX.254.XXX.2,/var/run/docker.sock" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-a-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&clusterArn=XXXXX-ecs-cluster&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%XXXXXX%3Acontainer-instance%2FXXXXX-ecs-cluster%2F9XXXXX6f8d44de9373595009fcd588&dockerVersion=DockerVersion%3A+19.03.6-ce&sendCredentials=true&seqNum=1" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Connected to TCS endpoint" module=handler.go
level=info time=2020-08-28T11:09:23Z msg="Connected to ACS endpoint" module=acs_handler.go
level=info time=2020-08-28T11:20:04Z msg="TCS Websocket connection closed for a valid reason" module=handler.go
level=info time=2020-08-28T11:20:04Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXecs-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXX3Acontainer-instance%2FZZZXXXXX-ecs-cluster%2F95XXX936f8d44de9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:20:04Z msg="Connected to TCS endpoint" module=handler.go

ecs-init.log

2020-08-28T11:09:19Z [INFO] pre-start
2020-08-28T11:09:20Z [INFO] start
2020-08-28T11:09:20Z [INFO] No existing agent container to remove.
2020-08-28T11:09:20Z [INFO] Starting Amazon Elastic Container Service Agent
1
What are you doing with the launch template? You seem to be missing an autoscaling group resource.jordanm
@jordanm I've added the part of autoscaling to the informationainsausti
Does the ECS UI show the cluster has members? On the pending task page, if you expand the container details, does it show any errors? Have you inspected the ECS logs on the cluster instances?jordanm
@jordanm I've added screenshots of the container instances and task information shown in the console. Regarding the ECS logs in the cluster instances, pardon my ignorance, but where can I see these logs? Do I have to access the instances through ssh?ainsausti
Yeah, you have to access the logs through SSH. I have worked with ECS a lot and nothing you have shown stands out as being wrong. I would recommend contacting AWS support.jordanm

1 Answers

2
votes

Finally!! Solved the mystery!

The problem wasn't in the cluster configuration. When calling through the ECS API to run_task you need to specify the subnet the task should run into.

Our code was setting in this field the value of one of the public subnets. For that reason, when we changed the container instances to the availability zone corresponding to this public subnet the task was placed.

Changing this call from the code the task is placed correctly and it has access to the internet.