10
votes

I have implemented the Job Observer Pattern using SQS and ECS. Job descriptions are pushed to the SQS queue for processing. The job processing run on an ECS Cluster within an Auto-Scaling Group running ECS Docker Tasks.

Each ECS Task does:

  1. Read message from SQS queue
  2. Execute job on data (~1 hour)
  3. Delete message
  4. Loop while there are more messages

I would like to scale down the cluster when there is no more work for each Instance, eventually to zero instances.

Looking at this similar post, the answers suggest scale-in would need to be handled outside of ASG in some way. Instances would self-scale-in, either by explicitly self-terminating or by toggling ASG Instance Protection off when there are no more messages.

This also doesn't handle the case of running multiple ECS Tasks on a single instance, as an individual task shouldn't terminate if other Tasks are running in parallel.

Am I limited to self scale-in and only one Task per Instance? Any way to only terminate once all ECS Tasks on an instance have exited? Any other scale-in alternatives?

3
Can you check if the instance is executing a job with a simple application installed on your instances? For example by getting the CPU/memory utilization?Mahdi

3 Answers

3
votes

You could use CloudWatch Alarms with Actions:

detect and terminate worker instances that have been idle for a certain period of time

3
votes

I ended up using:

  • A Scale Out Policy that Adds the same number of instances as pending SQS queue messages
  • A Scale In Policy that Sets to Zero instances once the SQS queue is empty
  • Enabling ASG Instance Protection at the start of the batch job and disabling it at the end

This restricts me to one batch job per instance, but worked well for my scenario.

3
votes

Another solution for the problem is the AWS Batch service announced at the end of 2016.