My team is evaluating Flink for a few use cases where we're looking at very large number of processing groups that we'd like to keep resource isolated. Are there known major pitfalls / bottlenecks that folks would expect to hit when running tens of thousands of Jobs in a single cluster?
So far we've noticed that the JobManager seems to slow down considerably after a few hundred jobs, with the recommendation here being to split the single large cluster into multiple smaller clusters. Is that the best recommended approach or is there a way to get Flink to run reliably with a very large scale of Jobs?