What are the major Flink bottlenecks when running many jobs?

Question

My team is evaluating Flink for a few use cases where we're looking at very large number of processing groups that we'd like to keep resource isolated. Are there known major pitfalls / bottlenecks that folks would expect to hit when running tens of thousands of Jobs in a single cluster?

So far we've noticed that the JobManager seems to slow down considerably after a few hundred jobs, with the recommendation here being to split the single large cluster into multiple smaller clusters. Is that the best recommended approach or is there a way to get Flink to run reliably with a very large scale of Jobs?

David Anderson David Anderson · Accepted Answer · 2018-10-28T02:11:57

One job per cluster can be an appealing approach, but of course if the jobs are short-lived, the overhead of starting a cluster for each job can be unfortunate. One advantage to this approach is security, as the jobs can be properly isolated from one another.

Going in the other direction, i.e. running a lot of jobs in a single cluster, as the number of task managers and jobs scales up, coordinating all of the checkpointing activity in the cluster can become a bottleneck (assuming checkpointing is enabled).

What are the major Flink bottlenecks when running many jobs?

1 Answers