4
votes

I have a little specific use-case here. I need to auto-scale a distributed web app running on ECS Fargate. The catch is that all the nodes need to keep the same data in memory (so increasing number of nodes does not help with memory pressure). Thus the increasing load can only be properly handled if it scales both horizontally (adding nodes) and vertically (increasing nodes memory).

Horizontal auto-scaling is simple. AWS CDK provides nice high-level constructs for load-balanced Fargate tasks and makes it super easy to add more tasks to handle CPU load:

service = aws_ecs_patterns.ApplicationLoadBalancedFargateService(
    self,
    'FargateService',
    cpu=256,
    memory_limit_mib=512,
    ...
)

scalable_target = service.service.auto_scale_task_count(max_capacity=5)
scalable_target.scale_on_cpu_utilization('CpuScaling', target_utilization_percent=60)

What I'm looking for is the vertical scaling part. So far my best idea is the following:

  1. Create a CloudWatch alarm for memory usage of the cluster. Trigger over 60%.
  2. The alarm sends a message to an SNS topic, which triggers a lambda function.
  3. The lambda describes the current task definition and parses out CPU and memory parameters. Then it creates a new version of the task definition with increased memory (and CPU if needed, because CPU and memory are not independent values in Fargate).
  4. Finally the lambda updates the service with the new task definition. This should trigger a rolling update and result in a cluster with the same number of nodes, but each with bigger memory.

Do you think this could work? Is there any better solution? Any potential issues you can spot?

Thanks in advance for any ideas!

1

1 Answers

3
votes

This seems like a reasonable way to go about this and could work.

An issue might be, that you don't keep track to the increased Memory demand in your IaC Template. This could result in the service being "reset" to minimal memory when you run a stack update that changes anything in the service.

To address this, you could create SSM-Parameters that hold the value of the CPU and Memory Units, which you reference in your template. Your Lambda would need to update them with the new values as well. This way updates to the service via CloudFormation/CDK shouldn't trigger the scale-up process immediately.

You're only scaling up in terms of memory, is there a scenario in which the memory demand decreases and you can scale down as well? (This can be done via the same/or a similar mechanism, just something to keep in mind)