Running k8s statefulset on AWS spot instance

Question

We ran some stateful applications (e.g. database) on AWS on-demand/reserved ec2 instances in the past, now we are considering moving those app to k8s statefulset with PVC.

My question is that is it recommended to run k8s statefulset on spot instance to reduce the cost? Since we can use kube-spot-termination-notice-handler to taint the node to move the pod to others before the spot instance terminated, it looks like it should be no problem as long as the statefulset has multiple replicas to prevent the service interrupted.

Clorichel Clorichel · Accepted Answer · 2018-11-30T15:49:31

There is probably not one and only answer to this question: it really depends on what it is as a workload you want to run, and how tolerant your application is to failures. When a spot instance is to be interrupted (higher bidder, no more available...), a well-done StatefulSet or any other appropriate controller will indeed do its job as expected and usually pretty quickly (seconds).

But be aware that it is wrong to assert that:

you'll receive an interruption notice each and every time,
and that the notice will always come in 2 minutes before a spot instance is interrupted

See AWS documentation itself https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#using-spot-instances-managing-interruptions and here's the excerpt "[...] it is possible that your Spot Instance is terminated before the warning can be made available".

So the real question is: how tolerant is your application to unprepared resources removal?

If you have just 2 EC2s running hundreds of pods each, you'll most likely NOT want to use spot instances as your service will be highly degraded if one of the 2 instances is interrupted, until a new one spins up or k8s redispatches the load (assuming the other instance is big enough). Hundreds of EC2s with few pods each and slightly over-provisioning autoscaling rules? You might as well just go for it and use the spot cost savings!

You'll also want to double-check your clients behaviours: assuming you run an API on k8s and pods are stopped before responding, make sur your clients handle the scenario and fires another request or at the very least fail gracefully.

But you spoke of databases: so how about replication? Is it fast and automated? Are there multiple replicates of data to allow for 1 to n replica loss?..

In other words: it just needs some good planning and thorough testing at scale. Good news is it's easy to do: run a load-test and voluntarily crash an instance, answers will meet you there!

Running k8s statefulset on AWS spot instance

2 Answers