3
votes

I'm running Python Celery (a distributed task queue library) workers in an AWS ECS cluster (1 Celery worker running per EC2 instance), but the tasks are long running and NOT idempotent. This means that when an autoscaling scale-in event happens, which is when ECS terminates one of the containers running a worker because of low task load, the long running tasks currently in progress on that worker would be lost forever.

Does anyone have any suggestions on how to configure ECS autoscaling so no tasks are terminated before completion? Ideally, ECS scale-in event would initiate a warm-shutdown on the Celery worker in the EC2 instance it wants to terminate, but only ACTUALLY terminate the EC2 instance once the Celery worker has finished the warm shutdown, which occurs after all its tasks have completed.

I also understand there is something called instance protection, which can be set programmatically and protects instances from being terminated in a scale-in autoscale event: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#instance-protection-instance

However, I'm not aware of any Celery signals which trigger after all tasks have finished out in a warm shutdown, so I'm not sure how I'd programmatically know when to disable the protection anyways. And even if I found a way to disable the protection at the right moment, who would manage which worker gets sent the shutdown signal in the first place? Can EC2 be configured to do a custom action to instances in a scale-in event (like doing a warm celery shutdown) instead of just terminating the EC2 instance?

2

2 Answers

2
votes

I think that while ECS scale-in your tasks it sends SIGTERM, wait for 30 seconds (default) and kill your task's containers with SIGKILL.

I think that you can increase the time between the signals with this variable: ECS_CONTAINER_STOP_TIMEOUT.

That way, your celery task could finish and no new tasks will be added to this celery worker (warm-shutdown after receiving the SIGTERM).

This answer might help you: https://stackoverflow.com/a/49564080/1011253

0
votes

What we do in our company is we do not use ECS, just "plain" EC2 (for this particular service). We have an "autoscaling" task that runs every N minutes, which depending on situation scales up the cluster by M new machines (all configurable via AWS parameter store). So basically Celery scales up/down itself. The task I mentioned also sends shutdown signal to every worker older than 10 minutes that is completely idle. When Celery worker shuts down, the whole machine terminates (in fact, Celery worker shuts it down via the @worker_shutdown.connect handler that powers off the machine - all these EC2 instances have "terminate" shutdown policy). The cluster processes millions of tasks per day, some of them running for up to 12 hours...