I am newbie in Flink, planning to deploy Flink session cluster on EKS with 1 job manager and 5 task managers (each task managers with 4 slots). Different jobs will be submitted through UI for different usecase.
Let's say I have submitted a stateful job (job has simple counter logic using RichFlatMapFunction) backed by RocksDBStateBackend with S3 checkpointDataUri and DbStoragePath pointed to local file path and this job utilises 8 slot totally which is spreaded across two task managers and running fine without any issues for a day. Now following are my question,
1) My understanding about checkpointDataUri and DbStoragePath in RocksDBStateBackend is, checkpointDataUri stores the processed offset information in S3 (since I configured the checkpointDataUri with S3 prefix) and DbStoragePath contains all the state information which is used in RichFlatMapFunction. So all the stateful information are stored in checkpointDataUri which is available in local only. Please correct me If it is wrong.
2) Lets say my Ec2 instance was restarted (the one where the 4 slots was utilised) for some reason and it took around 30 minutes to come online, in this case, EKS will make the new Ec2 instance as TaskManager to match the replicas, however whether Flink job manager will try to reschedule the 4 slots to a different task manager now? If yes, how the state which was stored in Ec2 local instance has to be recovered?
3) Is there is any document/video for Flink EKS failure recovery related things. I saw the official documentation which specifies how to deploy Flink session cluster in EKS. But I don't find anything related to failure recovery in EKS mode. Could someone please point me in the right direction on this?