Celery: AWS ECS Autoscale scale-in Event (how to not destroy long running tasks?)

Question

I'm running Python Celery (a distributed task queue library) workers in an AWS ECS cluster (1 Celery worker running per EC2 instance), but the tasks are long running and NOT idempotent. This means that when an autoscaling scale-in event happens, which is when ECS terminates one of the containers running a worker because of low task load, the long running tasks currently in progress on that worker would be lost forever.

Does anyone have any suggestions on how to configure ECS autoscaling so no tasks are terminated before completion? Ideally, ECS scale-in event would initiate a warm-shutdown on the Celery worker in the EC2 instance it wants to terminate, but only ACTUALLY terminate the EC2 instance once the Celery worker has finished the warm shutdown, which occurs after all its tasks have completed.

I also understand there is something called instance protection, which can be set programmatically and protects instances from being terminated in a scale-in autoscale event: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#instance-protection-instance

However, I'm not aware of any Celery signals which trigger after all tasks have finished out in a warm shutdown, so I'm not sure how I'd programmatically know when to disable the protection anyways. And even if I found a way to disable the protection at the right moment, who would manage which worker gets sent the shutdown signal in the first place? Can EC2 be configured to do a custom action to instances in a scale-in event (like doing a warm celery shutdown) instead of just terminating the EC2 instance?

ItayB ItayB · Accepted Answer · 2021-02-09T20:36:33

I think that while ECS scale-in your tasks it sends SIGTERM, wait for 30 seconds (default) and kill your task's containers with SIGKILL.

I think that you can increase the time between the signals with this variable: ECS_CONTAINER_STOP_TIMEOUT.

That way, your celery task could finish and no new tasks will be added to this celery worker (warm-shutdown after receiving the SIGTERM).

This answer might help you: https://stackoverflow.com/a/49564080/1011253

Celery: AWS ECS Autoscale scale-in Event (how to not destroy long running tasks?)

2 Answers