Our team set up a Flink Session Cluster in our K8S cluster. We chose Flink Session Cluster rather than Job Cluster because we have a number of different Flink Jobs, so that we want to decouple the development and deployment of Flink from those of our jobs. Our Flink setup contains:
- Single JobManager as a K8S pod, no High Availability (HA) setup
- A number of TaskManagers, each as a K8S pod
And we develop our jobs in a separate repository and deploy to Flink cluster when there is code merged.
Now, we noticed that JobManager as a pod in K8S can be redeployed anytime by K8S. So, once it is redeployed, it loses all jobs. To solve this problem, we developed a script that keeps monitoring the jobs in Flink, if jobs not running, the script will resubmit the jobs to the cluster. Since it may take some time for the script to discover and resubmit the jobs, there is a small service break quite often, and we are thinking if this could be improved.
So far, we have some ideas or questions:
One possible solution could be: when the JobManager is (re)deployed, it will fetch the latest Jobs jar and run the jobs. This solution looks overall good. Still, since our jobs are developed in a separate repo, we need a solution for the cluster to notice the latest jobs when there are changes in the jobs, either JobManager keeps polling the latest jobs jar or Jobs repo deploys the latest jobs jar.
I see that Flink HA feature can store checkpoints/savepoints, but not sure if Flink HA can already handle this redeployment issue?
Does anyone have any comment or suggestion on this? Thanks!