2
votes

We have a Flink cluster managed by different team. Cluster is shared between multiple jobs. So in any particular time any task manager is having slots running different jobs' operations.I have few question-

  1. Is this advisable to share cluster in prod with other jobs?
  2. If one job fails, it will kill task manager running threads of another job as well?
  3. If we have no other way and have to go with shared cluster, what is the best way to handle exception scenarios so that another job is not got killed when Task manager commit suicide with FATAL error?
1

1 Answers

1
votes
  1. I would recommend using Flink's job mode where you have a dedicated Flink cluster per job. This gives you job isolation and a malign Flink job won't be able to thwart your other jobs.

  2. If a job fails due to a task failure, then this won't affect other jobs being executed on the same TaskManager.

  3. If a TaskManager fails, then all currently executed tasks will fail. Consequently, all jobs which have at least one task which is executed on this TaskManager will fail and then need to be recovered. Currently, there is no way to enforce a per job isolation on a shared cluster. However, there is a JIRA issue which tries to solve this problem by introducing job-level tags. These tags could be used to control the scheduling of tasks belonging to different jobs.