I use Kubernetes (Openshift) to deploy many microservices. I wish to utilise the same to deploy some of my Flink jobs. Flink jobs are critical - some jobs are stateless that process every data (exactly once), some jobs are stateful that looks for patterns in the stream or react to time. No jobs can tolerate long downtime or frequent shutdown (due to programming errors, the way Flink quits).
I find docs mostly lean to deploy Flink jobs in k8s as Job Cluster. But how should one take a practical approach in doing it?
- Though k8s can restart the failed Flink
pod, how can Flink restore its state to recover? - Can the Flink
podbe replicated more than one? How do theJobManager&TaskManagerworks when two or more pods exists? If not why? Other approaches?