How to perform AWS CloudFormation autoscaling for ECS instance when cluster has insufficient memory available

Question

I have created CloudFormation template that creates ECS service and task and has autoscaling for tasks. It is pretty basic - if MemoruUtilization for tasks reaches certain value then add 1 task and vice verse. Here are some of the most relevant parts form template.

  EcsTd:
    Type: AWS::ECS::TaskDefinition
    DependsOn: LogGroup
    Properties:
      Family: !Sub ${EnvironmentName}-${PlatformName}-${Type}
      ContainerDefinitions:
      - Name: !Sub ${EnvironmentName}-${PlatformName}-${Type}
        Image: !Sub ${AWS::AccountId}.dkr.ecr.{AWS::Region}.amazonaws.com/${PlatformName}:${ImageVersion}
        Environment:
        - Name: APP_ENV
          Value: !If [isProd, "production", "staging"]
        - Name: APP_DEBUG
          Value: "false"
        ...

    PortMappings:
    - ContainerPort: 80
      HostPort: 0
    Memory: !Ref Memory
    Essential: true
  EcsService:
    Type: AWS::ECS::Service
    DependsOn: WaitForLoadBalancerListenerRulesCondition
    Properties:
      ServiceName: !Sub ${EnvironmentName}-${PlatformName}-${Type}
      Cluster:
        Fn::ImportValue: !Sub ${EnvironmentName}-ECS-${Type}
      DesiredCount: !Sub ${DesiredCount}
      TaskDefinition: !Ref EcsTd
      Role: "learningEcsServiceRole"
      LoadBalancers:
      - !If
        - isWeb
        - ContainerPort: 80
          ContainerName: !Sub ${EnvironmentName}-${PlatformName}-${Type}
          TargetGroupArn: !Ref AlbTargetGroup
        - !Ref AWS::NoValue
  ServiceScalableTarget:
    Type: "AWS::ApplicationAutoScaling::ScalableTarget"
    Properties:
      MaxCapacity: !Sub ${MaxCount}
      MinCapacity: !Sub ${MinCount}
      ResourceId: !Join
      - /
      - - service
        - !Sub ${EnvironmentName}-${Type}
        - !GetAtt EcsService.Name
      RoleARN: arn:aws:iam::645618565575:role/learningEcsServiceRole
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs

  ServiceScaleOutPolicy:
    Type : "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Sub ${EnvironmentName}-${PlatformName}-${Type}- ScaleOutPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref ServiceScalableTarget
      StepScalingPolicyConfiguration:
        AdjustmentType: ChangeInCapacity
        Cooldown: 1800
        MetricAggregationType: Average
        StepAdjustments:
        - MetricIntervalLowerBound: 0
          ScalingAdjustment: 1
  MemoryScaleOutAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub ${EnvironmentName}-${PlatformName}-${Type}-MemoryOver70PercentAlarm
      AlarmDescription: Alarm if memory utilization greater than 70% of reserved memory
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
      - Name: ClusterName
        Value: !Sub ${EnvironmentName}-${Type}
      - Name: ServiceName
        Value: !GetAtt EcsService.Name
      Statistic: Maximum
      Period: '60'
      EvaluationPeriods: '1'
      Threshold: '70'
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
      - !Ref ServiceScaleOutPolicy
      - !Ref EmailNotification

  ...

So when ever task starts to run out of memory we'll add new task. However at some point we'll reach the limit how much memory are available in out cluster.

So for example is Cluster consists of one t2.small instance then we have 2Gb RAM. A small amount of that is used by ECS task running in instance so we have less then 2GB RAM. If we set the value of Task's memory to 512Mb then we can put only 3 tasks in that cluster unless we scale up the cluster.

By default ECS service has MemoryReservation metrics that can be used for autoscaling cluster. We would tell that when MemoryReservation in more then 75% then add 1 instance to cluster. That's relatively easy.

EcsCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub ${EnvironmentName}-${Type}
  SgEcsHost:
    ...
  ECSLaunchConfiguration:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: !FindInMap [AWSRegionToAMI, !Ref 'AWS::Region', AMIID]
      InstanceType: !Ref InstanceType
      SecurityGroups: [ !Ref SgEcsHost ]
      AssociatePublicIpAddress: true
      IamInstanceProfile: "ecsInstanceRole"
      KeyName: !Ref KeyName
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          echo ECS_CLUSTER=${EnvironmentName}-${Type} >> /etc/ecs/ecs.config
  ECSAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
      - Fn::ImportValue: !Sub ${EnvironmentName}-SubnetEC2AZ1
      - Fn::ImportValue: !Sub ${EnvironmentName}-SubnetEC2AZ2
      LaunchConfigurationName: !Ref ECSLaunchConfiguration
      MinSize: !Ref AsgMinSize
      MaxSize: !Ref AsgMaxSize
      DesiredCapacity: !Ref AsgDesiredSize
      Tags:
      - Key: Name
        Value: !Sub ${EnvironmentName}-ECS
        PropagateAtLaunch: true
  ScalePolicyUp:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      AutoScalingGroupName:
        Ref: ECSAutoScalingGroup
      Cooldown: '1'
      ScalingAdjustment: '1'
  MemoryReservationAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      EvaluationPeriods: '1'
      Statistic: Average
      Threshold: '75'
      AlarmDescription: Alarm if MemoryReservation is more then 75%
      Period: '60'
      AlarmActions:
      - Ref: ScalePolicyUp
      - Ref: EmailNotification
      Namespace: AWS/EC2
      Dimensions:
      - Name: AutoScalingGroupName
        Value:
          Ref: ECSAutoScalingGroup
      ComparisonOperator: GreaterThanThreshold
      MetricName: MemoryReservation

However it does not make sense because that would happen when the third task is added so the new instance will be empty until 4th tasks is scaled. That means we'll be paying for instance that we don't use.

I have noticed that when ECS service tries to add task to cluster where there is not enough free Memory I get

service Production-admin-worker was unable to place a task because no container instance met all of its requirements. The closest matching container-instance ################### has insufficient memory available.

In this example the template's parameters are:

EnvironmentName=Production
PlatformName=Admin
Type=worker

Is it possible to create AWS::CloudWatch::Alarm that looks at ECS cluster events and looks for that particular pattern? The idea would be to scale up instance count in cluster using AWS::AutoScaling::AutoScalingGroup only when AWS::ApplicationAutoScaling::ScalingPolicy adds tasks that does not have space in cluster. And scale down the cluster when MemoryReservation is less then 25% (meaning that there are no tasks running there - AWS::ApplicationAutoScaling::ScalingPolicy has removed them).

Kashyap Kashyap · Accepted Answer · 2018-05-03T13:53:42

That means we'll be paying for instance that we don't use.

Either you pay for the extra/backup capacity in advance, or implement logic to retry the ones that failed due to low capacity.

Couple of ways I can think of:

You could create a custom script/lambda (https://forums.aws.amazon.com/thread.jspa?threadID=94984) that reports a metric say load_factor calculated as number of tasks / number of instances and then base an your auto scaling policy on that. Lambda can be triggered by a CW Rule.
- You could also report this from your task implementation instead of a new custom lambda/script.
Create a metric filter that looks for a specific pattern in a log file/group and reports a metric. Then of course use this metric for scaling.

From docs:

When a metric filter finds one of the terms, phrases, or values in your log events, you can increment the value of a CloudWatch metric.

How to perform AWS CloudFormation autoscaling for ECS instance when cluster has insufficient memory available

1 Answers