0
votes

We have couple of (> 5) windows framework 4.8 .Net MVC Web application workloads hosted in an Multi-AZ ECS cluster (EC2 type :Windows) exposed outside by an ALB. All of those applications are working fine for quite a period. Now its required to introduce auto scaling for these applications selectively (Out of 5, 3 needs scaling out/in) . We are thinking of leveraging below two features together to achieve this .

  1. ECS Service auto scaling to scale up each container instances(task level).

  2. ECS cluster auto scaling using ECS capacity providers in an EC2 Instance level. Which provides space for containers spun up by task in Step 1.

My question is ,is this achievable?, or is this the right approach for Windows Containers? . Why I am stressing Windows container is because AWS ECS lack many feature compared to Linux containers , for example We can't set Container memory Soft limit(memory reservation) but should mention a hard limit(memory) while configuring the task itself, which I think is a major limitation .

if this is not achievable what are the options ? We are not in a position to move to EKS now and obviously there is no Windows support for Fargate.

2

2 Answers

1
votes

Yes, this is achievable. We have an ECS cluster running half a dozen services using Windows docker containers. As you mention, there are two types of auto-scaling: at the task level and at the container instances level. The task-level auto-scaling is easier than the container-instances auto-scaling.

Container Instances

For the container instances auto-scaling group, you need the following resources:

  1. ECS Cluster (which you already have).
  2. A launch configuration for your container instances, using one of the Windows ECS Optimized AMIs.
  3. An auto-scaling group that uses your launch configuration to create new instances.
  4. Your scaling policies, which can be a combination of reservation target policies, scheduled policies, or capacity providers.

In the launch configuration, you need to register the new instances with your ECS cluster. We use UserData for this, and to configure some additional options (for example, to deal with the hard/soft memory limits in Windows that you mentioned):

<powershell>
[Environment]::SetEnvironmentVariable(ECS_RESERVED_MEMORY, 256, Machine) 
[Environment]::SetEnvironmentVariable(ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE, $TRUE, Machine) 
[Environment]::SetEnvironmentVariable(ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND, $TRUE, Machine) 
Initialize-ECSAgent -Cluster <your-cluster> -EnableTaskIAMRole -LoggingDrivers '[json-file,awslogs]' 
</powershell>

In terms of auto-scaling policies, you have multiple options here, so you have to choose what's best for your use case. We right now scale using a memory reservation target of 80% and have some scheduled scaling actions at certain times of the day to make sure we have enough capacity for a couple of times during the day when we have expected spikes. We have not started using capacity providers yet as there have been some bugs and issues around them, so I'm giving them some more time to "mature" (I think they have resolved many of the issues now... you can read more about this from this comment onwards).

This is pretty much all you need to scale your container instances. You can start adding tasks to your cluster and your auto-scaling group should start adding more container instances.

Be aware that without capacity containers, there can be a situation where the group doesn't scale, even though there is not enough memory or CPU in any instance to add new tasks. This will depend on your tasks' memory and CPU configuration and your reservation target (for example, you have a task that requires 20% of your current available memory capacity for the entire cluster, your cluster has 85% memory usage, but your scaling policy won't scale until your reach 90% memory usage). This is the problem capacity providers are designed to solve. In our use case and our task configurations, we never have an issue with the group not scaling, but we know we are overprovisioning our cluster to avoid this.

The other problem is when scaling-in your container instances: you need to set the instances to drain somehow so that your tasks have time to end gracefully, something which is not handled automatically by AWS. There is an open issue here which you can track. We solved this issue using a lambda function and an autoscaling life cycle hook based on this Github repo from this comment.

Tasks

Scaling tasks is easier:

  1. You need to associate your tasks to your load balancer so that tasks get registered/de-registered automatically in/from the ELB.
  2. You create a scaling policy for your task (for example, you can use the ALBRequestCountPerTarget to increase based on the amount of requests per target, or schedule autoscaling actions too).

In the task definition, you have to set your host port to 0, so that it gets automatically assigned by the ELB (read the docs for the hostPort here).

The major issue here is making sure your app is designed to work as a cluster (for example, using external session managers instead of in-memory session managers).

This is the gist of the process. There is a lot of documentation to read for each of these steps, but this should answer the questions you asked, point you in the right direction and help you avoid some of the issues we found along the way.

0
votes

In order to disable the hard memory restriction for windows containers set "ECS_ENABLE_MEMORY_UNBOUNDED_WINDOWS_WORKAROUND" environment variable to true on launch template UserData .

  1. AWS Launch template user data
<powershell>
[Environment]::SetEnvironmentVariable("ECS_RESERVED_MEMORY",256, "Machine") 
[Environment]::SetEnvironmentVariable("ECS_ENABLE_MEMORY_UNBOUNDED_WINDOWS_WORKAROUND", $true, "Machine") 
[Environment]::SetEnvironmentVariable("ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND", $true, "Machine") 
Initialize-ECSAgent -Cluster <clustername>-EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]' 
</powershell>
  1. In task definition Skip Memory hard limit and provide Soft limit(Reservation) which will be ignore by the agent.

Agent version should be >=1.32.1 (Greater the better)