I'm running Python Celery (a distributed task queue library) workers in an AWS ECS cluster (1 Celery worker running per EC2 instance), but the tasks are long running and NOT idempotent. This means that when an autoscaling scale-in event happens, which is when ECS terminates one of the containers running a worker because of low task load, the long running tasks currently in progress on that worker would be lost forever.
Does anyone have any suggestions on how to configure ECS autoscaling so no tasks are terminated before completion? Ideally, ECS scale-in event would initiate a warm-shutdown on the Celery worker in the EC2 instance it wants to terminate, but only ACTUALLY terminate the EC2 instance once the Celery worker has finished the warm shutdown, which occurs after all its tasks have completed.
I also understand there is something called instance protection, which can be set programmatically and protects instances from being terminated in a scale-in autoscale event: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#instance-protection-instance
However, I'm not aware of any Celery signals which trigger after all tasks have finished out in a warm shutdown, so I'm not sure how I'd programmatically know when to disable the protection anyways. And even if I found a way to disable the protection at the right moment, who would manage which worker gets sent the shutdown signal in the first place? Can EC2 be configured to do a custom action to instances in a scale-in event (like doing a warm celery shutdown) instead of just terminating the EC2 instance?