We want to build an ECS cluster with the following characteristics:
- It must run inside a VPC, then, we need the awsvpc mode
- It must use GPU instances, so we can't use Fargate
- It must provision dynamically the instances, therefore, we need a capacity provider
- It will run tasks (batch jobs) that are going to be triggered directly through the AWS ECS API. For this reason, we don't need a service, only a task definition.
- These tasks must have access to S3 (internet), so according to AWS documentation the instances must be placed inside a private subnet (a reference to docs).
We've already read this post in stackoverflow where it says that we need to set up a private subnet with a route table that points to a NAT Gateway configured in a public subnet, and this public subnet should point to an internet gateway. We already have this configuration. We also have an S3 vpc endpoint configured in the route table.
Bellow, you can see some relevant configurations of the cluster in terraform (for the shake of simplicity I only put the relevant parts):
# Launch template
resource "aws_launch_template" "train-launch-template" {
name_prefix = "{var.project_name}-launch-template-${var.env}"
image_id = "ami-01f62a207c1d180d2"
instance_type = "m5.large"
key_name="XXXXXX"
iam_instance_profile {
name = aws_iam_instance_profile.ecs-instance-profile.name
}
user_data = base64encode(data.template_file.user_data.rendered)
network_interfaces {
associate_public_ip_address = false
security_groups = [aws_security_group.ecs_service.id]
}
}
# Task definition
resource "aws_ecs_task_definition" "task" {
family = "${var.project_name}-${var.env}-train-task"
execution_role_arn = data.aws_iam_role.ecs_task_execution_role.arn
task_role_arn = aws_iam_role.ecs_train_task_role.arn
requires_compatibilities = ["EC2"]
cpu = var.ecs_cpu
network_mode = "awsvpc"
memory = var.ecs_memory
container_definitions = data.template_file.app_definition.rendered
tags = {
Stage = var.env_tag
Project = var.project_name_tag
}
}
# Cluster
resource "aws_ecs_cluster" "cluster" {
name = "${var.project_name}-${var.env}-train-ecs-cluster"
capacity_providers = [aws_ecs_capacity_provider.train-capacity-provider.name]
default_capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.train-capacity-provider.name
}
tags = {
Project = var.project_name_tag
Stage = var.env_tag
}
}
We also have configured all the roles needed for the instances and the task to access to the required resources (S3, ECR, ECS).
The AMI corresponds to an ECS optimized instance (the last version published at this moment in eu-west-1).
In the launch template we've removed the public IP to the instances due to the explanation in this link
We've evolved to this configuration trying to make this work, but once and again we've faced the same problem: when the task is triggered, the capacity provider launches an instance, but the task is never placed in the container instance and remains in the PROVISIONING status indefinitely.
With the same configuration but placing the instances into a public subnet, the tasks are placed into the container instances, but, as warned in the first link, the task has no access to the internet.
We need some enlightenment or a trace to follow. Thank you in advance.
UPDATE: As requested I've added the rest part concerning to autoscale
resource "aws_autoscaling_group" "train-autoscaling" {
availability_zones = ["eu-west-1b"]
desired_capacity = 0
max_size = 10
min_size = 0
protect_from_scale_in = true
launch_template {
id = aws_launch_template.train-launch-template.id
version = "$Latest"
}
tags = [
{
key = "Project",
value = var.project_name_tag
propagate_at_launch = true
},
{
key = "Stage",
value = var.env_tag
propagate_at_launch = true
}
]
}
resource "aws_ecs_capacity_provider" "train-capacity-provider" {
name = "${var.project_name}-${var.env}-train-capacity-provider"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.train-autoscaling.arn
managed_termination_protection = "ENABLED"
managed_scaling {
status = "ENABLED"
target_capacity = 100
maximum_scaling_step_size = 1
minimum_scaling_step_size = 1
}
}
}
data "template_file" "user_data" {
template = "${file("${path.module}/user_data.sh")}"
vars = {
cluster_name = "${var.project_name}-${var.env}-train-ecs-cluster"
}
}
Update 2 (AWS Console info):
Update 3:
After 30 minutes the task stops and this is the message shown (Task failed to start):
Update 4:
logs from container instance. ecs-agent.log
level=info time=2020-08-28T11:09:21Z msg="Loading configuration" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Amazon ECS agent Version: 1.44.1, Commit: 1f05fbf0" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2020-08-28T11:09:21Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2020-08-28T11:09:21Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:21Z msg="Loading state!" module=state_manager.go
level=info time=2020-08-28T11:09:23Z msg="Registering Instance with ECS" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Remaining mem: 7680" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registered container instance with cluster!" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registration completed successfully. I am running as 'arn:aws:ecs:eu-west-1:XXXXXXXXXXXXXXXX:container-instance/foqum-read-dev-train-ecs-cluster/95559f936f8d44de9373595009fcd588' in cluster 'foqum-read-dev-train-ecs-cluster'" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Beginning Polling for updates" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Initializing stats engine" module=engine.go
level=info time=2020-08-28T11:09:23Z msg="Event stream DeregisterContainerInstance start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXXX-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXXXX%3Acontainer-instance%2FXXXXXXXX-cluster%2F95559fXXXXXXde9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:09:23Z msg="NO_PROXY set:XXX.254.169.XXXX,XXXX.254.XXX.2,/var/run/docker.sock" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-a-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&clusterArn=XXXXX-ecs-cluster&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%XXXXXX%3Acontainer-instance%2FXXXXX-ecs-cluster%2F9XXXXX6f8d44de9373595009fcd588&dockerVersion=DockerVersion%3A+19.03.6-ce&sendCredentials=true&seqNum=1" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Connected to TCS endpoint" module=handler.go
level=info time=2020-08-28T11:09:23Z msg="Connected to ACS endpoint" module=acs_handler.go
level=info time=2020-08-28T11:20:04Z msg="TCS Websocket connection closed for a valid reason" module=handler.go
level=info time=2020-08-28T11:20:04Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXecs-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXX3Acontainer-instance%2FZZZXXXXX-ecs-cluster%2F95XXX936f8d44de9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:20:04Z msg="Connected to TCS endpoint" module=handler.go
ecs-init.log
2020-08-28T11:09:19Z [INFO] pre-start
2020-08-28T11:09:20Z [INFO] start
2020-08-28T11:09:20Z [INFO] No existing agent container to remove.
2020-08-28T11:09:20Z [INFO] Starting Amazon Elastic Container Service Agent