I just successfully dealt with the problem of long-running jobs in an auto scaling group using the relatively recent lifecycle hook feature.
The problem with trying to choose an idle node to terminate, in my case, was that the process that chooses the idle node will race against processes that submit work to the nodes. In this case it's better to use a strategy where any node can be terminated, but termination happens gracefully so that no work is lost. You can then use all of the standard auto scaling policy stuff to manage scale-in and scale-out.
The termination lifecycle hook allows the user (or a process) to perform actions on the node after it has been placed into an intermediate state (labeled Terminating:Wait) by the auto scaling group. The user (or process) is then responsible for completing the lifecycle action via an AWS API call, resulting in the shutdown of the terminated EC2 instance.
The way I set this up, in short, is:
- Create a role that allows auto scaling to post a message to an SQS queue.
- Create an SQS queue for the termination messages.
- Create a monitor script that runs as a service in each node. My script is a simple event-driven state machine that transitions in sequence from MONITORING (polling SQS for a termination message for the node) to DRAINING (polling a job queue until no work is being performed on the node) to TERMINATED (making the complete-lifecycle call).
- Standard configuration for event-driven AWS auto-scaling; that is, creating CloudWatch alarms, and the auto-scaling policies for scale-in and scale-out.
One hinderance to this approach is that the lifecycle hook management isn't supported yet in the SDKs (boto, at least, doesn't support it AFAIK), nor are there Cloud Formation resources for the hooks.
The relevant AWS documentation is here:
http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroupLifecycle.html